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Section 1: TNS Group Activities 


1 Introduction 


The TNS group’s testbed activities included: the development of the VuNet, an ATM-framed local 
distribution system; research on the design of host/network adapters; and the demonstration of 
network-based multimedia applications. 


Some of our achievements in the testbed program include: the deployment of the VuNet infras- 
tructure (encompassing 8 switches, 11 network-based video appliances, 11 host workstations and 
a suite of multimedia applications); demonstration of seamless DAN/LAN/WAN internetworking; 
the development of a number of different host/network adapters; and the investigation of ATM 
end-working and signaling issues. 


Along the way we learned a number of lessons concerning: the distinction between cell switching and 
ATM layer functions; the role of memory in the design of host interfaces and end-working software; 
the striping of ATM transmission facilities; and the traffic implications of computation-intensive 
multimedia applications. 


In the following subsections we describe the research objectives, approach, experimental results and 
lessons learned by members of the TNS group. 


2 Research Objectives 


The TNS group’s activities focused on the following subset of Aurora’s research objectives [5]: 
e The design and evaluation of ATM-framed local distributions systems 
e The design, development and evaluation of alternative host interfaces 
e The integration of gigabit switching, transmission and multiplexing systems 
e The demonstration and evaluation of new paradigms for distributed applications 


In accordance with MIT’s portion of the Aurora Statement of Work, we also worked with other 
interested groups to: 


e Develop specifications for the ATM layer and to investigate ATM adaptation 


The following paragraphs describe the relationship between these goals and the approach taken in 
our research. 


ATM-framed local distribution 


Our work on local distribution focussed on the VuNet, an ATM-based Desk Area Network. The 
VuNet, which was developed as part of the testbed, was deployed in individual offices and lab- 
oratories at MIT and a VuNet node was established at U. Penn. To facilitate VuNet / Sunshine 
integration, a VuNet node was located at Bellcore for a brief period. 


Host Interfaces 


Within the testbed as a whole, several areas of host interface research were investigated by the Aurora 
partners and taken together these provide a fairly good map of the design space. The MIT/TNS 
group investigated a portion of this design space, with particular emphasis on the adapter’s point of 


attachment, cost, and use of memory. We developed programmed I/O, DMA-based, and coprocessor- 
based adapters and collaborated with Bellcore on the design of the Osiris host interface. We also 
developed the Vidboard, an ATM-based camera that is directly attached to the VuNet. 


Gigabit switching, transmission and multiplexing 


The MIT/TNS group investigated a number of internetworking and systems integration issues and 
demonstrated the seamless ATM-based internetworking of: the VuNet DAN, the AN2 LAN, 
and the Sunshine WAN. We also collaborated with Bellcore on the design of an ATM-based cell 
processor for use in the port controllers of the Sunshine switch. 


Distributed Applications 


The VuSystem, a separately funded project, is a programming system for the software-based 
processing of audio and video data. Within the context of Aurora, we demonstrated the operation 
of VuSystem applications over the VuNet / Aurora infrastructures. 


ATM and Adaptation Layer Issues 


MIT researchers helped initiate the development of SEAL/AAL5, which has been adopted for a 
wide range of packet-oriented services (frame relay, IP, etc.). During the course of our research 
we investigated a number of issues related to ATM end-working and an application-oriented 
approach to signaling. 


3 Approach 


The following sections provide details concerning the approach, design and implementation of key 
technologies developed within the project. 


3.1 Local Distribution Systems - the VuNet 


A Desk-Area Network (DAN), as illustrated in Figure 1, is a small local system in which peripherals, 
such as multimedia devices, are taken out of the workstation and connected directly to the network. 
Workstations access these devices over the network and coordinate the movement of information 
streams among the different parts of the system. Aurora’s DAN implementation is known as the 
VuNet and the Vidboard, described in section 3.3, is an example of a DAN-enabled peripheral. 


The VuNet is a gigabit-per-second desk /local-area ATM network which interconnects general-purpose 
workstations, network-based multimedia devices and bridges to other networks. The DAN Approach 
replaces the workstation’s haphazard collection of peripheral interfaces and connectors (e.g. SCSI, 
VGA, EISA) with a generic ATM-based connector that supports a heterogeneous range of periph- 
erals including cameras, displays, disks, etc. We envisage a vigorous market for such peripherals 
which would be able to operate with any manufacturer’s platform. VuNet peripherals are highly 
programmable and are connected directly to the VuNet, allowing them to be shared among all hosts, 
including hosts not local to the peripheral’s desk area. 


With small networks, it is possible to rethink the ways certain network design issues are addressed. 
Many of these issues also arise in the design of Local-Area Networks (LANs), Metropolitan-Area 
Networks (MANs), and Wide-Area Networks (WANs), however their characterization within the 
DAN environment is different than in the LAN or WAN. For example, small desk area networks 
need not be designed for high utilization as traffic aggregation is not a principal design objective. 
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Figure 1: A Desk Area Network where devices are taken out of the workstation and attached directly 
to the network. The workstations coordinate the flow of data from various devices on the network 
to other devices. 


Larger networks can provide efficient communications because large numbers of hosts share the 
communications medium. In a small working area environment, there is a smaller number of hosts, 
making it more difficult to depend on statistics in sharing the aggregate bandwidth necessary for 
efficient operation. The fact that bandwidth is less expensive over the short distances involved 
suggests a VuNet design regime that leverages inexpensive, lightly utilized bandwidth. 


Similarly, shared wide area components, such as the line card serving a heavily multiplexed link, 
can be relatively expensive. However, in a small desk area network, careful attention must be paid 
to the cost per end node. Hence it becomes attractive to trade off functionality for cost in the non- 
multiplexed components such as the end system switch ports and interfaces. Simple, inexpensive 
interfaces become an important design objective. 


Within Aurora, both regions were explored. Bellcore’s Sunshine effort explored the wide area, and 
the VuNet explored the desk area. 


A key goal motivating the design of the VuNet hardware was simplicity. Sophisticated network 
functions (e.g. multi-cast, back-pressure, support for ATM adaptation layers, service classes) were 
pushed to the edge of the network and became the responsibility of the clients. We believe that a 
local environment such as the VuNet can be effectively served by a simple switch fabric having a 
limited number of access ports and an internal speed that is greater than that of the clients. 


The remainder of this section describes the VuNet design and the hardware that was built. The 
considerations that drove the design were: 


Software intensive philosophy: Wherever possible, the VuNet design favors software-based solu- 
tions that reduce the complexity of network hardware by moving functionality into the software 
of the end nodes. An important benefit of this software-intensive approach is the ability to port 
the system to higher-performance workstations easily. Over the span of the Aurora project, 
four different hardware platforms were used with two varieties of host interfaces, for a factor 
of ten difference in performance. 


Bursty traffic: Since host processes running within an operating system generate network traffic 
during their time slices, their generation of data is inherently bursty. Similarly, they consume 
data in bursts that are not synchronized to the incoming traffic. Accordingly, sources and 
sinks of video and audio information must be able to handle data in bursts and the DAN and 


its hosts and peripherals must support bursty traffic. 


Asynchronous client interface: Clients should not be slaved to the network clock. Instead, the 
network ports should support various input and output rates. ATM, with its small cell size 
and variable cell rates, can accommodate this requirement. 


Simple client interface: Fast and slow clients alike should see the network as a place to easily 
write and read ATM cells. The VuNet uses a modified cell format in which the cell length is 
increased from 53 to 56 bytes in order to make cells quadword-aligned. 


Data transparency: All classes of traffic (video, audio, file transfer, etc.) are treated in a similar 
manner until they reach the target application, and multiple video streams are easily supported 
since hardware limitations are not placed on their number. 


Interoperability and modularity: Multimedia peripherals such as a video capture board were 
designed to interface directly to the VuNet rather than to a workstation I/O bus. This allows 
the peripherals to be shared among all the hosts connected to the VuNet. It also avoids 
redesigning a peripheral for each vendor’s computer platform. 


These principles were applied to all aspects of the design of the local ATM distribution system. 


3.1.1 The VuNet Cell Switch 


In the development of the VuNet we separated the switching of cells (between switch ports) from 
ATM level functions such as VCI mapping. We started with a simple model of the “ideal” exchange 
of cells between two hosts that are directly connected to each other over a communications channel 
of zero length and infinite speed. An important observation is that even this channel requires a 
memory-based buffer or FIFO of some sort, if only to decouple the clock rates, and ultimately the 
application burst rates, of the communicating processes. 


To satisfy our “ideal” model of cell transfer, the switch fabric, which consisted of bidirectional ports 
with first-in first-out (FIFO) buffers which fed a crossbar matrix, allowed each host to accept cells 
at a rate that exceeds the aggregate rate at which cells are presented by all of the attached hosts. 
Current versions of the VuNet switch have either four or six ports and we clock the crossbar matrix 
at a rate of 700 Mbps per port. Given the small number of ports and the rate at which our present 
hosts are able to generate traffic, this provides a close approximation to the model. 


The switch provides a straightforward interface that simplifies the design of client hardware. Since 
the timing of the client side of the port FIFOs is decoupled from the internal timing of the switch 
matrix, there is no need for clients to synchronize their operation with the internal switch clock, as 
is often the case with traditional switch designs. Furthermore, both the transmit and receive blocks 
can be operated concurrently. The FIFOs also serve to buffer cell bursts, an important factor in 
our simplified network. Finally, the data bus width to the port is selectable, either 32 or 64 bits, 
allowing a simple mapping to workstation and processor buses. All these factors make it easy to 
design devices which connect directly to the switch, satisfying the “simple client interface” objective. 


The VuNet was designed around a non-standard 56-byte ATM cell, as the switches had 64-bit inter- 
faces. While this presented new issues when bridging to other networks (as discussed in Section 3.4), 
it greatly simplified the design and complexity of switch and the network-based devices. 


3.1.2 Cell Relaying 


The preceding section described cell switching among devices that are attached to a single switch. 
Inter-switch communication can be modeled by a device (or set of devices) that is connected to ports 
on two different switches. This cell relaying host performs two functions: it copies cells between 
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Figure 2: The six port VuNet switch. 
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Figure 3: The links in VuNet perform header remapping and next output port lookup for hop-by-hop 
cell routing. 


the switches and it re-maps VCIs. The relay serves as an intermediary with knowledge of the VCI 
address spaces associated with the switch ports. 


The VuNet link implements a minimal set of relaying functions, cell copying and VCI mapping. To 
support the inter-office separation of switches, it is constructed in two parts that are inter-connected 
by high speed channels as illustrated in Figure 3. The channels use single mode optics and serial 
coding based on the HP G-Link chipset and the links presently operate at 500 Mbps. Link processing 
is table driven, and the management of the link tables is left to the communicating end systems. 
This will be discussed in further detail in Section 3.5. 


The link tables are large enough to remap the entire 64K (16-bit) VCI space to 64K new VCIs 
and four bits of switch port information. All other bits in the ATM header remain untouched. 


The VuNet does not support the Header Error Check (HEC) function on a hop-by-hop basis as we 
do not believe that HEC is an essential function for desk area or even local area operation. For 
compatibility purposes a cell source can pre-compute the HEC that is expected by the destination. 


Management of the link tables is performed using control cells sent with a reserved ATM VCI. Cells 
received by the link on these special VCIs cause the link to re-write entries in its header lookup 
table. They can also cause the link to emit a cell containing table entries, so link tables can be read 


back. 


3.2 Host Interfaces 


Many researchers believe that host interface performance can be improved through the off-loading 
of network protocol processing, 1.e., moving functionality from the host to the interface. Within 
Aurora, several different types of host interfaces were developed to explore this hypothesis. For the 
VuNet, several types of interfaces were designed. MIT also participated with Bellcore in the design 
of their Osiris host interface. 


Using these various interfaces, different methods of segmentation and reassembly were investigated; 
this spanned the gamut from specialized on-board host interface hardware to extremely simple 
host interface hardware combined with software segmentation and reassembly. Different methods 
of host-interface synchronization were used including interrupt driven handling and device polling. 
The various interfaces also explored different network points of attachment. 


The simplest VuNet interface was a programmed I/O interface in which the processor was responsible 
for writing data to the interface. Another version supported DMA with variable burst sizes from one 
ATM cell size and up. In both cases, the segmentation and reassembly was performed by the host 
processor. An ATM-based coprocessor interface was also designed and fabricated. This interface, 
which resembled a floating point unit, connected directly to the coprocessor port of the CPU. 


Finally, MIT participated in the design of the Bellcore Osiris interface, a custom segmentation/reassembly 
hardware engine attached via the processor I/O bus. In the Osiris design large packets are transferred 
to the interface, which segments the packets before transmission into the ATM network. 


The following subsections describe the design of these host interfaces. 


3.2.1 VuNet Programmed I/O Interfaces 


All of the VuNet host interfaces rely on the CPU to perform the ATM segmentation and reassembly 
(SAR) tasks. Although this approach was computationally intensive, it provided a flexible envi- 
ronment for the study of many host interface issues. Other Aurora partners, working in parallel, 
investigated other aspects of the design space, especially hardware-based SAR. 


The first of VuNet interfaces designed was the VuNet Programmed I/O interface, known as the 
VupBoard. This host interface served only as a memory-mapped extension of the read port of 
the switch output FIFOs. Cells were read into memory through 14 reads of the memory address 
corresponding to the FIFO. Other addresses corresponded to control and status indicators such as 
the state of the FIFOs. 


Similarly, sending cells into the network required making 14 writes to memory, interleaved with 
writes to the address corresponding to the FIFO write control signals. 


While better methods of utilizing the processor were later investigated, this initial interface provided 
an early development platform that allowed us to explore software segmentation and reassembly of 
ATM cells and interrupt-driven packet processing. Using the VupBoard we were able to develop and 
debug device driver software that made the VuNet accessible to applications via the standard UNIX 
socket services. The driver also made the VuNet accessible via standard IP services, including NFS, 
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Figure 4: A block diagram of the VudBoard, which consists of 11 chips. 


Telnet, FTP, etc. Finally, since the programming model for the VupBoard was similar to (though an 
order of magnitude slower) that of the cell coprocessor, it provided a vehicle through which we could 
benchmark software and project the expected performance of a system with a coprocessor interface. 


3.2.2. VuNet DMA Interfaces 


The next VuNet-specific interface focused on improving the bandwidth and relaxing the temporal 
coupling between the host CPU and the interface — building interfaces, that are faster, rather than 
smarter. 


Although the design does not use any custom ATM circuitry, it does utilize the TURBOchannel 
Interface ASIC (TcIA)! for the DEC TURBOchannel bus. This bus is used in the DEC 3000 line 
of Alpha workstations, as well as in the DEC 5000 line of MIPS-based systems. The TclIA performs 
the DMA request and arbitration, and contains two 60-byte FIFOs for receive and transmit. 


As shown in Figure 4, the single chip PAL controls all data movement between the ATM network 
and TURBOchannel ASIC. Since the data bus from the VudBoard to the VuNet switch is shared 
by the receive and transmit paths, it arbitrates between reading cells from the switch and writing 
cells to the switch, with reception given priority over transmission. 


The counter is used to implement an optional transmission timer that controls the pace of cells 
injected into the network.’ Finally, the ROM provides device identification information. Although it 
is superfluous to our design, it is included on the interface in order to conform to the TURBOchannel 
specification. 


Various hardware and software configurations allowed experimentation with interrupt-driven ver- 
sus polled packet-processing. This allowed measurements of packet delivery latency and packet 
processing/context switching overhead. 


3.2.3. Cell Coprocessor Interface 


A cell-based coprocessor chip was designed to provide a direct interface between an ATM network 
and the coprocessor interface of a conventional RISC processor. The combined RISC processor /cell 


1The TcIA chip has been discontinued and is no longer available. 
? This could be used as an experimental way for implementing flow control, and is currently used for implementing 
pacing should a target host not have the ability to receive bursts of data at the full rate which can be transmitted. 
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Figure 5: Cell-Based Coprocessor Design 


coprocessor complex could form the core of an ATM-compatible workstation or could be used as 
a stand-alone cell processor, similar in function to Bellcore’s cell processing engine. To perform 
network operations, such as reading and writing cells, the RISC processor executes coprocessor 
instructions, much the way it performs floating point operations. 


iFrom a software perspective, this interface was quite similar to our programmed I/O interfaces. 
However, from simulations, a considerable performance improvement was to be realized — by com- 
pletely bypassing the memory subsystem and permitting the direct transfer of cell data to/from the 
CPU’s registers. 


Figure 5 is a block diagram of the prototype coprocessor which is designed to operate with the 40 Mhz 
versions of the MIPS R3000 processor. This work was closely aligned with Bellcore’s work on a stand- 
alone cell processing engine. The bus interface, timing logic, instruction decoder, and control/status 
registers have been substituted for Bellcore’s on-chip RISC engine and memory interface. The 
instruction decoder includes a pipeline follower that tracks the MIPS instruction stream, decoding 
and sequencing any instructions that pertain to the coprocessor. A large fraction of the design, 
including the network interface, cell buffers and register file has been directly copied from the 
Bellcore chip. The savings resulting from the substantial re-use of chip design and layout is a clear 
demonstration of the benefits of the close collaborative links established within Aurora. 


3.2.4 High Performance Host Interface 


One of the other interfaces designed within Aurora was the Bellcore Osiris board [7], designed with 
the assistance of MIT. This was a hybrid system with the ability to scatter ATM cell payloads into 
host buffers corresponding to VCIs. Though some overhead processing is needed to maintain and 
set up areas of memory for different packets that come in, no more processing is required than in 
transferring fully assembled packets from the host interface. 
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Figure 6: Block diagram of the Vidboard 


3.2.5 Summary 


Host interface functionality can be categorized into three primary functions: data movement, data 
processing, and data delivery. Data movement functions include reading data from processor mem- 
ory and writing the data to the network, and vice versa, while data processing functions include 
demultiplexing data, performing checksums, packet reassembly, and packet segmentation. Data 
delivery to the application level often entails working withing the methods used in the operating 
system, which may necessitate copying from operating system memory space to application memory 
space. 


Functions such as packet segmentation and reassembly (SAR) are some of the more complex func- 
tions. The complex interface approach would be to perform this processing on the interface itself 
through the use of customized hardware and/or a dedicated CPU. In a reduced complexity interface 
(RCT) these functions are performed by the host CPU itself, i.e., on the shared and general purpose 
processing engine of the workstation.? 


Scattering individual cell payloads into different areas of host memory is an attractive mix between 
hardware and software SAR. Scatter reassembly is a function that does not require a complex design, 
yet may save some per packet assembly costs. Essentially the only difference between this hybrid 
method and outboard packet reassembly is that the bus transfers occur in ATM-sized chunks in 
the former, and packet-sized chunks in the latter. The trade-off involves a reduction in memory 
requirements and latency in exchange for an increase in bus arbitration overhead. 


3.3. Information Appliances - the Vidboard 


Multimedia systems typically consist of workstations that contain a number of multimedia add- 
in boards for functions such as video and audio capture and playback. Our DAN equivalent, an 
ATM-based video capture board, known as the Vidboard, was use to explore the properties of a 
network-based video source. 


The Vidboard is based on a front-end frame-memory processor architecture that is capable of gen- 
erating full-motion video streams having a range of presentation (picture size, color space, etc.) and 
network (traffic, transport, etc.) characteristics. The architecture is novel in that it also permits the 
decoupling of video from the real-time constraints of the television world. Through a closed-loop 
control mechanism, a destination device can dynamically vary the frame rate of a video stream 
during the course of a session, which allows easier integration of video into the software environment 
of computer systems. 


°'These concepts are developed further in [10] 
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Figure 7: Block diagram of the AVlink 


3.4 Seamless ATM — DAN/LAN/WAN Internetworking 


One of the important ideas of ATM is that of “seamless” interconnection; that is, the concept of a 
uniform network service that transcends local and wide areas. Although standardized SONET/ATM 
represents one approach, ATM-based networks which do not adhere fully to the SONET standards 
and conventions can still be incorporated into the “seamless” network. The use of the ATM protocol 
reference model identifies differences between two ATM networks which can then be converged 
through relatively simple hardware. To investigate these issues we developed two specialized VuNet 


links, the AVlink and the Zebra. 


3.4.1 AVlink 


The AVlink is an ATM bridge interconnecting the VuNet and the Aurora testbed wide-area facilities. 
Differing design considerations for each network have caused a number of architectural differences 
between them. The AVlink overcomes these differences and provides low latency interconnection 
between the two networks. Performance numbers and several configurations in which the AVlink 
has been used are described in a Section 4.1.2. 


On the VuNet side, the AVlink is similar to other inter-switch links. On the Aurora side, the AVlink 
attaches to one of the four available OC-3c channels via Bellcore’s STS-3/OC-12 multiplexor. The 
use of multiple channels in parallel, a technique known as striping, is complicated by the presence 
of varying end-to-end delays between the channels. This effect, sometimes known as skew, was also 
investigated in the development of the Bellcore Osiris board. 


3.4.2 Zebra — Striping 


The Zebra is an ATM bridge interconnecting the VuNet to the AN2 switch, an industrial strength 
ATM LAN developed by Digital. The AN2 can, in turn, be connected to the Aurora facilities via a 
four channel OC-3 card and an OC-3/OC-12 multiplexor developed by Bellcore. 


In order to formalize the issues associated with network striping, we developed a reference model 
that distinguishes four degrees of freedom of relevance to striping implementations. These are: 


e Striping Topology: How and where the path between two hosts is split into stripes. Possible 
cases are end-to-end splitting, internal splitting, or some combination of these two. 


e Participation of Hosts and Networks: The degree to which the hosts and networks support a 
striping implementation. 


e Striping Layer: The network layer at which the striping occurs. This can range from the 
physical layer all the way up to the application layer, and determines the size of the striping 
unit, which is the largest unit of data which is transported intact on a single stripe. 
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Figure 8: Environment developed for striping experiments. 


e Implementation Issues: There are two main problems to overcome when implementing network 
striping; these are balancing the loads on the stripes, and maintaining synchronization across 
the stripes so that higher layer data may be correctly reassembled. The implementation issues 
relate to the various options available for solving these problems and the means by which the 
characteristics of the striped facilities are made visible to the layer at which striping is being 
performed. 


An environment that interconnected the VuNet DAN, the AN2 LAN, and the Sunshine WAN using 
the Aurora facilities, was developed (Figure 8) to explore and experiment with the functional space 
mapped out by the model. As part of this environment, a bridge between the VuNet and AN2 was 
developed. Known as the Zebra, this board converges the differences between the two networks. The 
configuration linking these three networks, as well as some lessons learned about network striping, 
will be presented in a later section. 


3.5 ATM End Working 


Using the VuNet as a research platform, several aspects of service integration were investigated. 
First, VuNet network communications were fully integrated into the system, directly via ATM- 
supported UNIX sockets, and indirectly, via [P-supported interfaces (including UDP and TCP sock- 
ets). This allowed all the usual Internet services, e.g., FTP, Telnet, NFS, etc. to be offered over the 
VuNet. 


Also, the Vidboard network based video capture device was used as a platform for real-time traffic 
generation within the network. Source traffic shaping allowed investigation into real-time traffic 
interaction with the host device drivers. A closed-loop control mechanism was developed for graceful 
degradation of real-time traffic bursts. 


Investigations into the jitter arising from the middle and upper layers within the end systems of 
multi-service networks were also conducted. 


3.5.1 A UNIX Device Driver with Software SAR 


Our initial requirements of data transparency, interoperability and modularity dictated the structure 
of our network device driver. The driver, as shown in Figure 9, was designed to be organized into 
several major levels. At the highest level are the application processes, which receive data over the 
UNIX socket interface. Below this are the kernel side of the socket interface level, the reassembly 
and protocol processing level, and the buffer and device management level. 
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Figure 9: Cells are read from the network at and passed to the software reassembly process at regular 
polled intervals. These assembled packets are then passed to the IP handler or directly to a VuNet 
socket. 


13 


Capture loop cell 


Video frame 
a oi cell 


| Vidboard 


| ie SSS Continuous capture loop 


Transmit on a 


Application 


F Buff / 
Sogeweneee | 4 | Video frame segment 


Figure 10: Video source control model: 1) A command cell is sent to the Vidboard, starting it in a 
2) continuous video capture loop, where, upon 3) receipt of a frame request cell from workstation, 
a 4) video frame is delivered. 


If cells have been received on the network, the driver temporarily halts the VudBoard in order to 
point it to a new kernel buffer, freeing up the other buffer to be queued for reassembly. Buffers were 
allocated and quickly swapped in order to minimize the time the host interface is “deaf.” Because 
of the size of the output buffers on the switch, the driver had to be designed so that cells were not 
lost while this was occurring. Thus, the VudBoard was restarted before the buffers which contained 
network data were processed. 


The buffers that are queued by the previous level may contain interleaved ATM cells over many 
difference VCIs. The reassembly process must look up the VCI to determine the type of connection, 
and reassemble the cell payloads into packets, stripping out the ATM header information and other 
related information. This is done by copying the data into different buffers reserved for reassembled 
packets. 


Depending on the kind of packet reassembled, packets are passed directly to the VuNet application 
socket, or may be passed to the IP Packet Handler for additional protocol processing and further 
demultiplexing. The UNIX socket mechanism is used as the data delivery mechanism for the appli- 
cations that use our network data. Once the data is delivered to the socket level, it can be copied out 
to user space on user reads, completing the delivery of data to the application. Integrating UNIX 
sockets allowed the seamless integration of the network to the application level. 


Clearly, our software-based SAR approach involves considerable copying of network data, and this 
is an issues which we will return to in the lessons section of this report. 


3.5.2 Vidboard Network Protocols 


Hosts in the VuNet are not dedicated to processing and forwarding network data - they are user 
machines under varying loads. In order to prevent the backlog and loss of cells while hosts are busy 
with user tasks, a protocol was developed for the use of network based peripherals, such as the 


Vidboard. 


The command cell protocol is used in making the Vidboard execute a particular task. The protocol 
consists of using a single ATM cell, which is formatted into a number of fields, to carry command 
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information to the Vidboard. The command code field indicates the task to be executed. Additional 
information about the task is carried in the command parameters field. 


As shown in Figure 10, cells are usually generated in response to a command cell. Other examples 
of responses are status information in response to a diagnostic command and a video stream in 
response to a video data request. The response to a video request has two types of characteristics: 
video adaptation layer and video traffic shaping. 


For sending video data from the Vidboard to a workstation, we adopted an adaptation layer similar to 
that specified by the ATM Adaptation Layer 5 (AAL5) standard, except that the cyclical redundancy 
check (CRC) value for error-detection has been replaced by a more easily computed checksum value. 
In this protocol, scan lines are grouped into packets which are then fragmented into ATM cells. The 
size of the packet is user-programmable. 


The Vidboard is capable of generating video faster than some workstations can process it. In order 
to avoid overwhelming the workstation network interface with video data, the Vidboard implements 
traffic shaping. Video is sent in small bursts separated by delays, during which the burst is processed 
by the workstation. The workstation requesting the video sets four parameters in the video request 
command cell to tune the video stream’s traffic characteristics to its needs. 


As part of the protocols for controlling the Vidboard from across a network, a scheme was developed 
which dynamically varied the Vidboard’s frame rate during a video session. The Vidboard is placed 
in a constant video capture loop and is sent requests for frames as they are needed. The application 
can vary the rate of requests to match its ability to process video. As an example, a video-in-a- 
window application gracefully halves its rate of requests as a second video-in-a-window is started 
and the two applications have to share the available machine cycles. 


3.6 Application-oriented Signaling 


Each application in the VuNet is responsible for opening, maintaining, and closing its own connec- 
tions. This is done in a “wormhole” fashion by way of ATM control cells embedded in the cell 
data stream. Applications use library functions that access a shared file and execute an allocation 
algorithm that prevent nodes from stealing other nodes’ VCIs. Background processes can be used 
to verify link tables and refresh connections when necessary, such as in the case where a link card 
has been power cycled. In order to maintain table consistency when the network is reconfigured, it 
is possible to run topology daemons that allow each host to discover the network topology. 


3.7 The VuSystem 


The VuSystem[14, 15] is a programming system for the software-based processing of audio and 
video data. The VuSystem was designed to run on ordinary Unix workstations which have no 
specific support for the manipulation of multimedia data. Because the VuSystem includes an easy 
to program and extensible in-band processing component, it is uniquely suited to the development 
of applications that perform intelligent processing of live media. 


VuSystem applications [13] combine intelligent media processing with traditional capture and display. 
Some process live video for more responsive human-computer interaction. Others digest television 
broadcasts in support of content-based retrieval. Both classes demonstrate the utility of network- 
based multimedia systems that deliver audio and video data all the way to the application. 


Examples of some VuSystem applications include: 


The Room Monitor processes video data from a stationary camera in a room. It processes the 
live video to determine if the room is occupied or empty, and records video frames only when 
activity is detected above some threshold. It produces a file containing a series of video clips 
that summarize the activity in the room. A video browser is used to view the segments. 
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Figure 11: The whiteboard program. 
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Figure 12: The sports browser. 
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The Whiteboard Recorder keeps a history of changes to an office whiteboard. It works by taking 
video from a stationary camera aimed at the whiteboard and filtering it. By following a simple 
set of rules, the filtering distills the video into a minimum set of images. A browser (Figure 
11) can be used to view the saved images. 


The News Browser provides interactive access to television news articles. CNN Headline News is 
automatically partitioned into segments and saved on disk at regular intervals. The stories are 
viewed with a video browser program. News stories that are closed-captioned can be retrieved 
based on their content. 


The Joke Browser records late-night talk show monologues, and segments them into jokes by pro- 
cessing the closed-captioned text. It extracts information from a recorded monologue through 
the analysis of closed-captions. In addition to the text of the jokes, the captions contain hints 
to the presence of audience laughter and applause. A joke parsing module groups captions 
into jokes. 


The Sports Highlight Browser records and segments a nightly sporting news telecast into a set 
of video clips. Each clip represents highlights of a particular sporting event, and the clips can 
be browsed as shown in Figure 12. This application generates its annotations through the 
template matching of scoreboard graphics. 


While many VuSystem applications do not require a high speed ATM network in order to function, 
they were easily layered on top of the VuNet. Performance of the applications using VuNet based 
devices equaled or exceeded performance using specialized video and audio capture devices. 


4 Experiments and Results 


4.1 The VuNet 


Eight VuNet nodes were deployed at MIT, Bellcore and the University of Pennsylvania over a period 
of three years. Each VuNet node consisted of a four or six port switch equipped with a link (or 
AVlink), one to three Vidboards, and one or two host interfaces. The nodes at MIT were typically 
connected in a ring, with a stub protruding from the ring which connected, through Aurora and 
two AVlinks, to a similarly-outfitted VuNet node located at Bellcore in Morristown, NJ, then at 
University of Pennsylvania, as shown if Figure 13. ViewStation applications required no modification 
to operate over Aurora. 


In total, 8 switches, 11 network-based video appliances, 11 host workstations and a suite of multi- 
media applications were deployed to demonstrate seamless DAN/LAN/WAN internetworking. 


The following subsections describe some results achieved with the VuNet hardware, using as the 
primary example, the Vidboard client connected to the VuNet running the VuSystem software. 


4.1.1 Local VuNet Performance 
Each switch functioned at a speed of 700 Mbps per port, per direction, for an aggregate capacity of 
2.8 Gbps for the four ports switches and 4.2 Gbps for the six port switches. Individual switches were 


run at over 1.5 Gbps per port, for an aggregate capacity of 9.0 Gbps, though this speed was not 
used in deployed switches. Within the local ATM ring, the inter-switch links operated at 500 Mbps. 


4.1.2 AVlink/Zebra Performance 


Seamless interconnection really works. Despite the lack of a common physical layer standard, the 
integration of the different ATM networks was performed quite easily; cells from the VuNet were 
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Figure 13: Wide-area configuration. 


routed via the Sunshine switch and accepted by the VuNet node on the other end. The only issues 
then become connection management in the separate conjoined networks. The VuNet used cell-based 
signaling whereas the Sunshine switch used out-of-band connection management. However, this was 
bridged by allowing Sunshine to switch the VuNet connection management cells. 


In a separate experiment, the AN2 switch was attached to the VuNet as a stub off the main ring. 
On another port of the AN2 switch was a connection to the Aurora/SONET facilities, using an AN2 
line card which generated 4 OC-3c channels and a Bellcore multiplexor board which generated an 
OC-12 channel from four OC-3c channels and vice-versa. Since only one AN2 switch was available, 
two hosts on the local VuNet ring were connected using a loopback topology. Software was written 
for the line card processor which allowed it to support the in-band connection management used 
by the VuNet. Using this configuration, VuNet hosts were able to set up connections through each 
of the OC-3c channels, which they could use as stripes. Code could then be written to implement 
various striping schemes outlined by our reference model. 


4.2 Host/Network Adapters 


The work on the host adapters spanned four generations of host workstations. Initially, the VuNet 
hosts were DEC 5000/200 (25 Mhz) and later 5000/240 (40 Mhz) workstations based on the MIPS 
R3000 processors. During the middle of the project, all the workstations were upgraded to 133 MHz 
DEC 3000/400 Alpha workstations. At the end of the project, one 200 MHz DEC 3000/800 Alpha 


workstation was acquired and connected to the VuNet. 
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4.2.1 VupBoard 


Using the VuSystem software, the performance of the VupBoard was measuring using video generated 
by a Vidboard. The VupBoard driver was a very processor-intensive process as it required making 
TURBOchannel bus reads for getting data from the network. 


While this contributed to the processor load, hence slow interface speed, while reading and writing 
to the network, the primary limitation of the VupBoard was due not to this, but to the interrupt- 
driven nature of the board. Interrupt service latencies were not guaranteed in any way, causing data 
to be potentially lost in the buffers of the switch when the device was not able to be serviced while 
large amounts of data arrived. 


Initially, with the DEC 5000, 10 Mbps of data was delivered to the application level. Upon upgrading 
to the DEC 3000/400, 15 Mbps of data was delivered to the application. As the VupBoard was phased 
out soon after introduction of the VudBoard, no numbers are available for the DEC 3000/800. 


4.2.2. VudBoard 


Using an application which reads and writes data to VuNet sockets along with the standard VuSystem 
applications, we tested the performance of the current VudBoard. 


When a packet is queued for transmission, the packet segmentation occurs before the resulting page 
can be moved cell-by-cell over the TURBOchannel to the VudBoard. The peak data transmission 
of a single page occurs at 232 Mbps. However, since we can only DMA one page at a time (due to 
a bug in the DEC ASIC used), we must process an interrupt for each page to be transmitted. The 
performance achieved for transmitting many pages is 67 Mbps. On the receive side, 512 Kbit bursts 
at 232 Mbps can easily by processed by the VudBoard. The bottleneck on the receive side, however, 
is the host processing. 


When data arrives at the host interface, it must contend for the TURBOchannel bus to transfer its 
data into host memory. Due to bus arbitration, grant latency, and a wait period between successive 
transfers, the host interface is unable to write to memory at a rate greater than 232 Mb/sec. However, 
since the host cannot process data at this rate, it cannot be sustained indefinitely. The maximum 
sustained rate at which data could be transferred across the bus was 131 Mb/sec. 


When host processing is factored in and the data is reassembled and delivered to the application 
level, we find that we can indefinitely sustain a data rate of 100.7 Mb/sec of assembled data packets 
delivered to the application memory space. The device driver must process and reassemble raw ATM 
data at 111 Mb/sec in order to sustain this packet data rate to the application level. The actual 
data transfer rate over the TURBOchannel bus is actually 134 Mb/sec, but this reflects transfer of 
64 bytes per ATM cell. 


Performance of the VudBoard is summarized in Table 1, along with VupBoard performance numbers. 
Throughput measurements are made while running the full suite of ViewStation software within the 
UNIX operating system. 


Data delivered to the application level 
(Megabits per second) 


Tnterface | PIO [DMA | DMA/Improved Diver 


DEC 5000/200 10.4] 15.4 
DEC 3000/400 (133 MHz) |} 14.7 | 34.4 66 
DEC 3000/800 (200 MHz) 100.7 


Table 1: VudBoard performance measured by data delivered to application level. 
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4.2.3 Coprocessor Simulations 


A coprocessor chip for the VuNet was designed and fabricated. Though it was not integrated into 
a MIPS-based workstation, it was partially debugged to the extent that cell registers could be read 
and write data using our logic analyzer/signal generator. A full suite of simulations were performed, 
and the following describes our experiments and results. 


To measure the throughput attainable by the coprocessor host interface, we performed timing exper- 
iments on a floating point coprocessor. The processes’ view of the coprocessor interface essentially 
consists of coprocessor general purpose registers and coprocessor control registers. These are avail- 
able on a floating point coprocessor as well, and since the timing of all coprocessors (whether floating 
point or network interface) with respect to the main processor in the R3000 architecture is the same, 
the results we obtained can be expected with the network interface cell coprocessor also. 


The workstation configurations evaluated were a DECstation 5000/200 and a DECstation 5000/240. 
We ran a series of experiments in which the Interface Model, Transfer Direction, Cell Oper- 
ations, and Processor Speed were independently varied. The results are summarized in Tables 3 
and 4. Some of the simulations present optimistic results that provide an upper boundary on the 
burst performance that could be achieved by the traditional interfaces. For example, our DMA 
values capture the data transfer rate between memory and the registers. Thus, we assumed that the 
DMA transfer has occurred at little interference to the processor, and the rate of this DMA transfer 
is much higher than the rate at which data can be accessed by the processor. 


| Name Function | 


Vv Programmed I/O VupBoard interface. For receive routines, this 
consists of loads from TURBOchannel to processor registers, fol- 
lowed by stores into memory. For transmit, it consists of loads 


from memory to the processor registers, followed by stores to the 
TURBOchannel. 


Same as Vup, except with no header byte swap. 


Similar to VUP, using the coprocessor approach. 


up 
ory. For transmit, load header into CPU registers, and load body 
directly from memory to coprocessor registers. 
Cop3 In the receive direction, store entire cell directly into memory. For 
transmit, load entire cell directly from main memory to coprocessor 
registers. 


Same as Copl, except with no header byte swap. 


| Cops | Same as Cop2, except with no header byte swap. | 
| Cop6 | Same as Cop3, except with no header byte swap. | 


In the receive mode, load data that has already been stored into 
main memory by DMA. In the transmit mode, store data into main 
memory to be read by the network through DMA. 


| DMAI1 | Same as DMA, except with no header byte swap. | 


Table 2: Interface models 


Cop2 For receive, move header from coprocessor to processor registers. 
Then store body directly from coprocessor register file to mem- 


e As expected, the simulated numbers predicted a throughput higher than observed. In the 
programmed I/O path, the receive and transmit throughput suggested by the index numbers 
are 89 Mbps and 166 Mbps. This can largely be accounted for by the fact that we assumed 
that TURBOchannel writes and reads took 3 and 8 cycles respectively. Some logic analyzer 
experiments conducted by Chris Lindblad and Dave Tennenhouse indicate that the read per- 
formance on the TURBOchannel is 14 Turbo-cycles. This changes our predicted value in the 
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Table 3: Peak Throughput (with header byte swap) 
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Table 4: Peak Throughput (without header byte swap) 


receive direction to 48 Mbps, which is much closer to the observed value. Unfortunately, no 
experimental values are available for TURBOchannel write direction. 


The observed throughput in the DMA7 case are 290 Mbps and 196 Mbps, in the receive and 
transmit directions, respectively. These values take into account the entire path of data, all 
the way from the network DMA device, to the correct location in memory. The Bellcore 
Osiris board [7] runs at 480 Mbps and 384 Mbps. However, these figures are for the direct 
path between the DMA device and memory, and do not take into account the copying of this 
data into some final location in memory. Our predicted throughput figures (294 Mbps and 
268 Mbps), are limited by the memory subsystem, rather than the throughput between the 
memory and DMA device. Thus these provide an upper bound on the overall performance 
achievable, if a DMA board, such as the Bellcore board, were used. Even if the DMA board 
could transfer at higher rates into memory, the total throughput would be bound by the 
performance of the memory system. 


Receive 


‘Transmit 
103 


Table 5: Predicted vs. Observed Throughput (in Mbps) 
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4.3. Vidboard Performance 


The high speed of the VuNet and the properties of its client interface make it so that the traffic 
characteristics of a data stream injected at one point in the network are preserved as the stream 
arrives at its destination. This property was exploited to tailor the traffic characteristics of a video 
stream generated by the Vidboard to the needs of the workstation receiving it. 


Our software environment processes multimedia information in small bursts. Video frames are 
segmented into transmission frames (t-frames) which are transmitted in bursts separated by delays. 
A pseudo-AAL5 segmentation protocol [16] is used to ship the t-frames. The stream is shaped using 
the following parameters: number of scan lines per t-frame (LT), number of cells per burst (CB), 
interburst delay and interframe delay. These parameters are used to shape the video stream to 
match the burst absorption timing properties of the workstation. These properties are related to 
the latencies of the interrupts and routines involved in processing a burst of data. 


scan lines per t-frame interburst 
Pe eae 
cells per burst 4 16 120 480 
See 


ee 
14 7.7 | 8.7 9.1 9.5 


Table 6: Effects of traffic shaping on frame rate 


Video type 


Picture size Black and white Dithered Color 
frames/s (Mbits/s) | frames/s (Mbits/s) 


640x480 15.0 (37) 4.3 (10.5) 
320x240 30.0 (23) 15.0 (9.2) 
160x120 30.0 (4.6) 30.0 (4.6) 


Table 7: Single Vidboard to Alpha frame rates. 


The effects of traffic shaping on the frame rate performance of a video display application are reported 
in Table 6. The type of video stream used is 640x480 8-bit monochrome and the workstation used 
is a DEC 5000/200 using a VupBoard. The table shows the frame rate achieved for different values 
of the CB and LT parameters. For each pair of values, the delays are found that result in the best 
frame rate. The results show the intuitive conclusion that larger bursts and t-frames correspond to 
higher frame rates. Since the workstation takes a single interrupt to read in a burst, larger burst sizes 
mean less interrupt overhead over the course of a frame. The benefits of larger burst sizes decrease 
as the bursts become larger because the interrupt overhead becomes insignificant in comparison to 
the time spent reading the burst during the interrupt. Larger t-frame sizes increase the frame rate 
because less t-frame processing occurs. 


Using the protocols described in Section 3.5.2, experiments have been conducted to determine the 
video frame rate which could be achieved between one or more Vidboards and an Alpha workstation 
displaying video with audio. Different types of video experience bottlenecks in different parts of 
the system: the Vidboard, at any resolution and depth, is limited by the 30 frame per second rate 
of the camera signal; at full resolution, the relative timing of the current host software and the 
Vidboard limit the performance of a single stream to less than the full frame rate; eight bit color 
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Figure 14: Processing Times of Representative Filter Modules. 
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Table 8: Rates of VuSystem Applications. 


video streams are limited by the Vidboard’s dithering firmware; and with twenty-four-bit color video, 
the bottleneck is a host software module that dithers the pixels for display on 8-bit screens. 


The aggregate video rate to a single workstation, achieved with concurrently active Vidboards, was 
about 74 Mb/s. This is equivalent to full resolution (640x480) 8-bit black and white video at 30 
frames per second displayed on the host display. The limiting factor was the alpha in that the 
segmentation, display and other processing required consumed the processor. In the case of color 
streams, dithering was the limiting factor. 


4.4 VuSystem 


Our experience with the VuSystem has revealed it to be a good platform for the investigation of 
concrete ways that computers may become more responsive to their human users. Users of the 
VuSystem have developed applications that assist users by performing various tasks that require the 
analysis of live video [13]. 


The VuSystem has also been used to explore the potential of media processing applications to support 
content-based retrieval of pre-recorded television broadcasts. VuSystem-based media browsers use 
textual annotations that represent recognizable events in the video stream. These annotations are 
analyzed and processed to create higher level representations that may be meaningful to a human 
user. 


We performed a series of throughput experiments to measure the processing rates of the Exercise 
and Edge filtering modules. The Exercise module walks over each pixel in each video frame to 
simulate a variety of processing operations. It was chosen as a baseline. The Edge module detects 


23 


and highlights edges within each frame. It was chosen as an example of an algorithm we use in 
practice. Both modules were written in C/C++ and not painstakingly optimized; comparable rates 
may be expected from the straightforward coding of other modules. 


Both the DEC 3000/400 and SPARCstation 10/512 were found to be able to process video at up to 
15 Mpixels per second, regardless of frame size. They may perform edge detection at better than 
live rates for 320 x 240 x 8 bit frames. This shows that visual processing of live video in software is 
viable. We expect further improvements with future advances in workstation technology. 


Module DEC 3000/400 | SPARC 10/512 
ms/frame (fps) | ms/frame (fps) 
Exercise 


Edge 
WindowSink 


FileSource 


Vidboard 


Table 9: System Throughput 


The rate at which video may be passed to the X server for display was measured with a WindowSink 
module. The FileSource module was used to measure how quickly video could be sourced from a 
local SCSI-II disk. The Vidboard module was used to read live video via a VuNet [11] interface from 
a Vidboard [1] video capture peripheral. This last module was available only on the DEC 3000/400s. 
Table 9 shows the observed rates for 320 x 240 x 8 bit frames, and includes rates for the Exercise 
and Edge modules for comparison. 


The WindowSink module was shown to be capable of processing video at over 100 frames per 
second. Video can be passed to a windowing system, in this case the X Window System with the 
XShm shared memory extension, and displayed without writing directly to the frame buffer. On 
the multiprocessor SPARCstation 10/512, the operating system was able to achieve a considerable 
speed-up over a single processor SPARCstation, presumably by running the X server process in 
parallel with the video application. 


The FileSource and Vidboard modules were observed to process video at approximately 30 frames 
per second. This is a lower rate than that for the WindowSink and Exercise/Edge modules, but the 
figures are less conclusive. 


The Vidboard result exceeded the frame rate of live video and is primarily bounded by the structure 
of the network interface. It again serves a lower bound. 


The rates of Exercise and Edge filtering modules, from 50-200 frames per second, compared favorably 
with those of the sources and sinks. We consider this a positive result for our approach, since 
the incremental cost of processing video does not dominate. Applications such as live video edge 
detection and display run at approximately 15 frames per second. 


VuSystem module-level dispatching is non-preemptive, and depends on the Unix process scheduler for 
dispatching operations. To verify that this scheduling system is adequate for many perceptual-time 
processing tasks, we measured the exact time that Timeout member functions run. We instrumented 
the system to compare the actual time at which a Timeout is called to the time for which it was 
scheduled. If the VuSystem performs well, the difference between the actual time and the scheduled 
time will be slight. We recorded this Timeout precision for runs of the vsdemo program. Four 
runs were taken, each indicating Timeout precision under different system loads, created by running 
multiple concurrent vsdemo processes. 


Figure 15 shows a histogram of the percentage of Timeout calls, as a function of the number of 
milliseconds after the scheduled time that Timeout was called. It has been truncated at 20ms. 


24 


90 One vsdemo process 
80 
.---| Two vsdemo processes 

2 70 |.....|Three vsdemo processes 
5 
3 60 
& 
€ Four vsdemo processes 
= 
5 50 
© 
g 
c= 40 
g 
8 
5 
oa 30 

20 

10 

0 ia 
oo41 2 3 4 5 1°12 13 14 15 16 17 18 19 20 


62. Bs BAO A 
Milliseconds From Scheduled Time 


Figure 15: Timeout Precision Histogram. 


one two three four 

process processes processes processes 

98.9% 89.2% 86.4% 69.8% 
99.9% 96.0% 91.5% 77.6% 


Table 10: Timeout Precision. 


These measurements were made on the Digital DEC 3000/400, where the precision of the time-of- 
day clock is one millisecond. 


Table 10 indicates the cumulative percentage of Timeout calls made within 5ms, 10ms, 15ms, and 
20ms of target, as a function of the number of multimedia processes running on the Digital DEC 
3000/400. The chart shows that almost all of Timeout calls were made well within the universally 
accepted 15 ms of jitter introduced by the “3-2 pulldown” process used to convert motion pictures to 
NTSC video. The chart also shows that precision of timeout calls gracefully degrades with increased 
system load. 


5 Lessons Learned and Concepts Demonstrated 


The TNS group’s testbed activities covered the development of the VuNet, an ATM-framed local 
distribution system; research on the design of host/network adapters; and the demonstration of 
network-based multimedia applications. 


Through deployment of the VuNet infrastructure, the seamless integration of the wide-area Aurora 
facilities, and the use of the VuSystem software, we were able to demonstrate a wide variety of 
concepts and draw lessons concerning: 


e The distinction between cell switching and ATM layer functions 
e The striping of ATM transmission facilities for seamless integration 
e The integration of ATM-based Information Appliances 


e The host interface design space 


The traffic implications of computation-intensive multimedia applications 
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5.1 Cell Switching and Desk Area Networking 


By simplifying the switch design and integrating certain functions into links and end nodes, we 
effectively separated the cell switching functions from the ATM level cell relaying functions. This 
was an important distinction which may prove useful in a number of ways. 


Minimal hardware functionality in the switch is an important consideration for a local area ATM 
network. The switches need not perform complex functions such as those incorporated into the 
Sunshine switch. In a desk area, only a small number of switch ports need full ATM functionality. 
For example, serial to parallel conversion and VCI mapping are not required by directly attached 
devices; they are needed only at switch ports supporting inter-switch links. 


Although current ATM standards intertwine cell relaying and ATM level functions, we suggest that 
the relevant standard bodies (ATM Forum, ITU-T) consider dividing the present ATM layer into cell 
switching and ATM functions. The former could then be used as the basis of a universal I/O system 
based on cell switching. Such an approach would unify the haphazard and expensive collection of 
connectors found on the backs of existing PCs and workstations. While some of the devices attached 
to such an I/O system might be fully ATM compliant, others, such as keyboards, mice, etc., may only 
implement the cell level functions. These non-ATM devices could use the features of cell switching 
without the complexity of VCI mapping, etc. 


5.2 Seamless ATM and Striping 


The AVlink demonstrated that one advantage of using ATM in both the local- and wide-area net- 
works is the simplicity of interconnecting them. The AVlink only needs to make minor cell changes 
at the ATM and physical layers, which can be done as the cells pass from VuNet to Aurora. The 
design of the AVlink allows the LAN/WAN boundary to appear seamless to clients on either side of 
the interface. Hence, applications which worked within the local area worked just as well over the 
wide area. 


The effort to solve the skew problem caused by the SONET facilities highlights striping as an 
important architectural issue. There is an innovation mismatch between local area networks and the 
wide area facilities provided by the telephone carriers. Telephone companies will only upgrade their 
facilities when the aggregate demand of their customers is large enough to recover their investment. 
Local area networks cost much less to upgrade and maintain than the wide area facilities. Aggressive 
users of LANs who want a leading edge wide area connection find themselves in a position where 
they are unable to get the amount of bandwidth they need or want at a cost that is acceptable. 
This coupling of technology in the two domains is a barrier to innovation and migration to new 
equipment. Network striping is a method to decouple the selective adoption of technology in the 
local area from its wholesale deployment in the wide area. 


Striping at the ATM layer is particularly attractive as it permits us to implement an end-to-end 
striping scheme but still derive the low level benefits of using a small, fixed-size striping unit. Striping 
at a higher layer reduces the range of upper layer protocols and applications that benefit. Striping at 
lower layers prevents us from implementing an end-to-end solution. Whereas most previous striping 
efforts have focussed on providing increased bandwidth for all traffic over a single hop, there is a 
need for end-to-end striping solutions that can provide increased bandwidth to the aggressive users 
who require it and that are not dependent on the deployment of striping support at nodes within 
the network. End-to-end methods that are (responsive to but) independent of the characteristics of 
the underlying physical layer will adapt as the equipment in the network is gradually upgraded. 
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5.3. ATM-based Information Appliances 


A number of important conclusions, concerning the properties of a network-based video capture 
module, can be drawn from the Vidboard research. The ability to temporally decouple video from 
its television source combined with a closed-loop control model facilitates the integration of video into 
the virtual-time computing world. Video service can be adapted to the available system resources 
and degrades gracefully as resources become scarce. A second conclusion is that a capture module 
needs to be able to accomplish tasks related to the distributed nature of the environment and not 
only capture/transmit video. Task examples are network transport, traffic shaping and distributed 
control. 


With the trend towards integration of video into the digital domain, research groups are devel- 
oping cameras which produce video in a digital format [18] that will come to replace television 
camera/capture board systems for computer video applications. Similarly, with the trend towards 
distributed computing, video capture systems which connect directly to a network are becoming 
desirable. A digital network camera would consist of a lens, a CCD array, and circuitry for network 
interfacing, distributed control and frame buffering. The Vidboard architecture could provided the 
basis for this camera of the future. 


5.4 The Host Interface Design Space 


A thorough approach to host interface design can be taken by answering five questions—the division 
of protocol processing between the host and the interface board, the method of data transfer, the 
role of memory, the point of attachment to the host’s memory hierarchy, and the synchronization 
means by which the processor and interface coordinate their actions. Most of the previous work on 
host interfacing, including comparative studies of different approaches, has focussed on one of these 
issues at a time. Furthermore, there has been relatively little work concerning the role of memory. 


The above questions should be answered within the context of a specific operating environment. 
Specifically, it 1s important to consider whether or not the processor is expected to touch all the 
words of the network data. For example, a file server may move data (via its memory) between its 
disk and network interfaces without moving the data to or from the processor registers or caches. 
Similarly, applications receiving network data are often blocked, leaving the processor available for 
protocol processing, whereas a server might have other clients, not waiting on the incoming data, 
that could make better use of the available cycles. 


Although the TNS group’s activities considered many aspects of interface design, we made a con- 
scious effort to avoid the issue of hardware-based segmentation and reassembly(SAR). Other groups 
within Aurora (Bellcore, Penn, Arizona) were actively investigating this question and we did not 
want to duplicate their efforts. Accordingly, our results and lessons should be viewed as comple- 
mentary to theirs, rather than competitive. 


5.4.1 Protocol Processing and Data Transfer 


Using the VudBoard, we explored the design and use of a simple DMA interface to a gigabit network. 
The primary functions of the interface and the driver software - are to move, process and deliver 
data. While our host-based SAR approach involves considerable copying and processing on the 
part of the host, we find that reduced complexity interfaces (RCT) can deliver good performance at 
reasonable cost. 


Changes to our driver’s API, such as Arizona’s FBUFs and Penn’s buffer strategy, could be used to 
avoid one of the copies. An important question is whether or not to adopt a hardware-based SAR 
approach to avoid the other copy. The answer to this question will depend on a number of factors, 
including cost and the environment in which the host will be used. In many cases, especially server 
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Figure 16: The memory distributed throughout the various parts of the system can be thought of 
as a large virtual buffer. 


environments, we would opt for the reduction in memory copy traffic made possible by our partners’ 


work on hardware-based SAR. 


Nonetheless, the VudBoard’s simple approach, even without an FBUF-style API, achieves a level 
of performance that is compatible with commercially available (OC-3) switches. Furthermore, the 
relative performance of memory subsystems, which had been rather stagnant, is now improving, 
partly in response to lessons learned within the testbeds. As memory subsystems continue to improve 
— driving down the cost of the SAR-imposed memory copy — gigabit rate host SAR may become 
viable, possibly by the time OC-12 switches are widely available. 


5.4.2 The Role of Memory 


The design of the VudBoard and its associated device driver demonstrates an important lesson 
concerning the role of memory. Since the buffers of the switch port are limited to a small number 
(256) of cells, the interface must maintain a nearly continuous flow of cells into host memory. In 
practice, the switch and host buffers are concatenated together by the interface and can be thought 
of as one large “virtual buffer,” as shown in Figure 16. 


The interface and device driver are designed to preclude lapses that would cause cells to be lost at 
the network boundary. Interestingly enough, it was this consideration rather than raw throughput 
that drove us towards a DMA based design — one in which the transfer of cells would continue while 
the processor was performing other tasks. The only time that cell transfer is interrupted is for brief 
periods, measured in tens of instructions, during which the device driver updates the control registers 
(especially the buffer pointers) of the interface. Cells arriving during these periods are absorbed by 
the switch buffers. 


In retrospect, we are now more sensitive to the roles that various bits and pieces of memory play. 
The switch output buffers absorb bursts of cells from multiple sources and are also used to absorb 
cells while the host is performing DMA maintenance; the memory on the interface board (e.g. the 
FIFO of the TcIA chip) is used to decouple the timing of the host from the network; and the kernel 
buffers are used to hold large bursts while the host is processing the data. Ultimately, the caches 
feeding the processor registers decouple the timing and throughput of the processor from that of the 
memory subsystem. 
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Figure 17: Model Connection Architecture 


5.4.3 Network Point of Attachment — Coprocessors and Temporal Decoupling 


A study of modern RISC architectures reveals a remarkable similarity in the organization of the var- 
ious modules of the memory and I/O system. To focus attention on the relatively under-emphasized 
question of the choice of the point of attachment, we have developed a model of the memory-I/O ar- 
chitecture of a workstation which allows us to identify potential “plug-points” for network interfaces. 
This “Model Connection Architecture” is shown in Figure 17. The figure reflects the functional re- 
lationship of the various modules of the memory hierarchy, and does not necessarily describe the 
physical placement of the various modules. 


In Figure 17, we see the different modules that a network adaptor can emulate, to make the network 
appear to reside at the different levels of the architecture. Thus, the network can be made to appear 
to reside in an I/O device of the workstation, in main memory, in cache’, or in registers. Different 
issues arise when the network adaptor emulates each level. 


Within Aurora, the I/O and register (coprocessor) approaches were investigated. Other projects 
such as Alewife [2] have investigated pseudo-caches, and yet others have experimented with pseudo- 
memories. 


Although the coprocessor approach appears attractive, we learned that it is only suited to a limited 
range of environments, such as single task embedded applications. In these environments, the 
recelving or transmitting application is always active and the processor registers are available to 
perform timely cell processing. In effect, the processor is synchronized with the cell streams arriving 
and departing through the coprocessor interface. 


In a multi-tasking environment, applications run in timeslices during which bursts of data are gener- 
ated and/or processed. During these bursts, the application may process data at rates faster than the 


*Most systems have several levels of cache 
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network. Between timeslices, the network runs faster than the application. For these non-embedded 
applications, approaches such as DMA, that leverage the burst absorption properties of the memory 
subsystem, are more appropriate than a coprocessor. 


The broader lesson that we have learned is the importance of temporal decoupling, 1.e., allowing var- 
lous system components to manipulate data at differing rates and granularities. Memory subsystems 
(including caches) have been carefully designed to absorb bursts and handle speed and granularity 
mismatches.® Therefore, it is our current belief that general purpose interface designs should not 
bypass the memory subsystem. 


5.5 VuSystem Applications - Implications for Network Traffic 


Although recent network and multi-media research has paid considerable attention to the support 
of long-lived streams of continuous traffic, we believe that the network traffic generated by media- 
intensive applications will be dominated by its available bit rate (ABR) components. Our chain of 
reasoning is as follows: 


e Actual multimedia capture traffic is not continuous, but instead is periodic bursts of cells. 
Data is captured at a low average rate, but transmitted in packets at the available bit rate. 
Furthermore, our closed-loop capture approach [1] allows us to adapt to the available bit rate 
during video capture. This type of feedback has long been used in video displays. 


e Intelligent applications might have radically different display traffic patterns. For example, a 
security monitoring program might only update the display when it recognizes suspect activity. 
In effect, much of our display traffic is of the X-windows variety. 


e Through increased buffering at capture and display, we relax the temporal constraints on data 
transfer, thereby allowing available bit rate surpluses to offset shortages. 


e For every bit transferred over the network for capture and display, there is a multiplier effect 
— many more bits are transferred in support of storage and media-intensive processing. This 
traffic is totally decoupled from the real world, since our file servers and media processing 
programs are decoupled from real-time capture and display phenomena, such as sensors and 
phosphor. 


The impact of the technology curve on the final two points is significant. Declining memory prices 
will lead to larger capture and display buffers. Similarly, exponential growth in processor and 
network capacity will yield increasingly intelligent applications that have larger multiplier effects. 
Although some continuous bit rate (CBR) and predictable variable bit rate (VBR) components may 
remain they will be relatively insignificant. 


Table 11 categorizes the network usage for the capture and browsing components of the ViewStation 
applications we have developed. The table indicates the video data format, network traffic type, 
bits transferred, subsession duration, and subsession frequency for example application components. 
For application components that pass data in periodic packet-trains for long periods of time, the 
number of bits transferred is indicated per second. Otherwise, the number of total bits transferred 
is indicated. We define a subsession as a single transfer of a video segment. One program session 
could include many subsessions. For example, a Joke Browser session consists of a subsession for 
each joke replayed. 


Historically, as computer networks have increased in capacity, traffic becomes burstier. Traffic 
patterns are bimodal, with the bulk of the data transferred in a relatively small number of large 
bursts. Over time, the size of the large bursts continues to grow, and we expect this pattern to 
continue. The video server of the future will closely resemble the file server of the present — but the 
units of transfer will be measured in gigabytes instead of kilobytes and megabytes. Consequently, 


>The VudBoard design, in which we pour network data to / from memory leverages this burst absorption property. 
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Table 11: The network usage classes for some ViewStation application components. 


future networks must support traffic patterns that are at least as bursty as those on present-day 
LANs, at speeds and burst sizes that are orders of magnitude larger. 


The presence of bursty traffic may have significant implications with respect to the distribution of 
memory buffers across the network, and the presence of short subsessions may affect the design 
of network signaling systems. If network traffic is fundamentally bursty even at large bit-rates, 
then high-speed networks must be designed to support higher peaks in load than if the traffic is 
less bursty. Except for backbone networks that support highly aggregated traffic, gigabit network 
designers should not plan for high utilization, but instead should accommodate highly variable loads, 
especially during peak busy hours. 
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Section 2: ANA Group Activities 


1 Introduction 


The research of the Advanced Network Architecture group had several high level objectives. 


The first was to demonstrate that workstations, not just supercomputers, could be high performance 
components of a fast network. While supercomputers represent a valid target for experimentation, 
there are several reasons to concentrate on the workstation. The workstation architecture represents 
the form of computing that will have the major economic impact on the market. The supercomputer 
will represent, in any generation, that machine too expensive to produce in quantity. So experimen- 
tation that leads to advances in workstation performance are most directly relevant to advancing 
the state of the art in broadly applicable ways. 


The second objective was to show that in order to achieve high performance, it is not necessary 
to restrict the network to one technology, such as ATM or HIPPI. While much of the attention at 
gigabit rates has been focused on specific technology options, our experience with IP has taught 
us that no one technology is likely to dominate the marketplace: today we have competing high 
speed LAN technology, specifically 100 mb/s Ethernet and ATM, and we have ATM, SMDS, Frame 
Relay and IP as competing wide area transport offerings from the telecommunications industry. We 
saw no reason why the gigabit arena should be different. We thus proposed, as a core research 
goal, to recognize the issues of heterogeneity at high speeds, and to understand if there were critical 
issues that needed to be addressed. We desired to show that high performance can be achieved in 
heterogeneous networks with different sorts of technology bases combined within them, in the way 
the Internet is today built out of different sorts of technologies. 


It is interesting to see how expectations have changed since the first proposals for Aurora were 
drafted. At that time, many people wondered if IP and in particular TCP were suitable for use 
at gigabit speeds, and many people thought that gigabit communications implied supercomputers. 
During the time of the project, these perceptions were essentially reversed. It became clear that 
because most supercomputers had I/O architectures poorly suited to networking and processors 
targeted heavily towards numerical computation, advanced workstations could execute network pro- 
tocols as fast or faster than these larger machines. Further, most people now assume that once 
computers have sufficient processing power and memory bandwidth, IP and TCP will operate at 
gigabit rates. 


This change in perception caused us to alter the specific objectives of our research. For example, at 
the beginning of the project, we proposed to build a high speed router, both to further our research 
objectives and because it was not clear that people knew how to structure such a device. Later, it 
became obvious that the construction of this device would be a waste of time, because commercial 
products were proceeding toward the target naturally. 


2 Research Goals and Approach 


To address these top level concerns, we identified a number of specific research areas. We looked at 
several aspects of heterogeneous computing, including the performance aspects of interconnection. 
We looked at issues of resource management, in particular support for explicit Quality of Service 
(QoS) control and approaches to congestion control. We looked at alternative models of protocol 
modularity, and alternative models of link multiplexing. We looked, to a limited extent, at issues 
of scale in high speed systems. We participated in the testbed efforts in switch design. Finally, we 
constructed a number of experimental devices to evaluate our concepts. 
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2.1 Heterogeneous Interworking 


We define heterogeneous network architectures as those which can flexibly incorporate many net- 
work and end-system technologies, and which can easily evolve over time as new technologies are 
developed. Within the Aurora project, our overall goal was to demonstrate that heterogeneous net- 
work architectures are effective and appropriate for communication at gigabit rates. We looked at 
existing paradigms for heterogeneous interconnect, in particular IP packet forwarding, and assessed 
the performance requirements of this task. We then mapped these performance requirements to 
possible architectures for a gigabit forwarding element (the gigabit successor to a IP router). Fi- 
nally, we proposed a novel approach to organizing protocols, an approach we call Application Level 
Framing, or ALF. We hypothesized that ALF might have better characteristics as an approach for 
heterogeneous interconnect. We discuss these study areas in the sections below. 


In general, heterogeneity is closely aligned with architectural structure and modularity. Systems with 
modular organization and well-defined interfaces are more flexible and evolvable than those designed 
as a single, unified entity. However, these otherwise desirable principles may introduce significant 
performance limitations if not carefully executed. A goal of our work was to carefully differentiate 
between fundamental performance limits and those created as an artifact of poor structuring, and 
to identify new structuring abstractions appropriate for use in high-speed networks. 


As an aid to understanding the real problems of heterogeneous interconnect, we coded a high per- 
formance IP forwarder, in order to understand its performance implications, both fundamental and 
structural. We measured its performance, and used software tools to assess where the performance 
issues lay. 


It has been suggested that any use at all of operating system or similar modular abstractions inside 
the high-speed forwarding path of a router will introduce unmanageable overhead, and instead one 
must implement the forwarding software as stand-alone code or in hardware. To understand this 
issue, we explored one operating system, the x-kernel [12], which is specialized for the implementation 
of network protocols, but at the same time attempts to provide an abstract interface to key system 
services such as buffer management. 


2.2 QoS 


An overarching research objective of our group has been the development of models and mechanisms 
for the support for Quality of Service. This term is generally taken to mean the capability of a 
network to offer multiple, controlled levels of service, and of an application to request from the 
network service which explicitly offers the needed values of key performance parameters such as 
bandwidth, delay, and loss rate. Our core work in this area includes the study of application 
requirements vis a vis traditional network service models, the development of new service models that 
the network can present to the application, the development, simulation and implementation of new 
algorithms which implement these service models through queue management and packet dropping, 
and the development related protocols for setting up resource reservations and for admission control. 


Our work in this area is broadly based, combining activities within the Aurora context, collaborative 
efforts within the ARPA-funded DARTnet testbed, and within our local research group. This broad 
perspective has enabled us to identify and focus on solutions applicable to a wide range of networking 
environments. Our work has led to proposed standards for providing this capability in the IP protocol 
family, and in part to the current approach to adaptive bit rate service in the ATM forum. 


Within Aurora, our objectives have been the following. First, we wished to determine if any fun- 
damental, as opposed to implementation, issues arose from the high speeds involved. Second, since 
ATM is a key technology for emerging broadband networks, we wished to apply our overall results 
specifically to the ATM context. 
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A key issue arose with respect to ATM. While ATM carried the promise of sophisticated QoS sup- 
port, the standards at the time did not realize this capability. Rather, they defined an interface to 
the capability from higher layers, without giving any guidance as to how ATM network elements 
might actually support it. This implied the need to develop a framework for management of band- 
width within ATM. We approached this by adapting our more general framework to the specific 
requirements of ATM We then studied our adaptation through simulation, and through the use of 
a programmable output port controller board for the Bellcore Sunshine ATM switch, which allowed 
us to demonstrate cell-level traffic management in the ATM portion of the Aurora network. 


High speeds imply that the data forwarding path in the router or switch must process a large number 
of packets or cells per second; at 622 mb/s the cell rate for an ATM link is over one million cells per 
second. This implies that any cell-level processing must be simple and efficiently implementable. 
We thus undertook the specific project of tuning our basic QoS control algorithms, which support 
a rather sophisticated model for allocation of network bandwidth, for the specific case of ATM 
forwarding. Our approach was to evaluate the detailed performance characteristics of the algorithm, 
and to propose approximations to the exact algorithm that may be more efficiently implemented. 


2.3. Congestion Control 


A topic closely related to support for QoS is the problem of congestion control at high speeds, 
especially in ATM networks. Two general approaches to this problem are available; hop-by-hop 
control and end-to-end control. Our work focused on the end-to-end approach, which is used in the 
Internet today. 


Congestion control in the IP protocol family is handled primarily by TCP!. TCP contains an algo- 
rithm for rate adaptation, based on packet loss as a feedback signal. Early experiments with TCP 
over ATM, performed at a number of sites, demonstrated that improper design of the output buffers 
in the switch were causing cell losses, which interacted with the TCP algorithms to cause very poor 
performance. For this reason, we studied the congestion control and avoidance algorithms for TCP, 
which might be implemented in the host running the TCP, or perhaps even in the network. Our 
approach was based on simulation using highly detailed models of a current best-practice TCP and 
of the Sunshine ATM switch. 


Current congestion control algorithms in TCP hunt for the correct sending rate; increasing their 
sending rate until they encounter an indication of congestion (a lost packet) and then backing off, 
after which they begin again to increase their rate. The current backoff algorithm can reduce the 
achieved sending rate to one packet each round trip, which is such a reduction that it can effectively 
prevent effective use of any high speed link. We thus explored a number of modifications to the 
current implementation practice for TCP, with the goal of improving its effective performance over 
links of high delay and bandwidth. These were evaluated using our network simulator and real tests 
over portions of the testbed and the Internet. 


2.4 Alf 


Application Level Framing, or ALF, is a proposal to restructure the modularity of network protocols. 
There are a number of objectives for this remodularization, but the one relevant to the gigabit 
environment was the hypothesis that the breaking of the data into transmission units would be more 
efficient with ALF than with traditional packet switching. In particular, for ATM, the application 
data unit, which might be substantially larger than a packet, could be directly broken into cells 
rather than being first broken into packets. 


1This is becoming more and more unacceptable as UDP-based real-time data traffic grows to be a significant 
percentage of total internet traffic. 
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These architectural ideas were first proposed by us in 1990 [4]. We hypothesized a number of 
advantages, one of which related to speed. We thought it might be possible to build a faster router 
forwarding ALF elements than forwarding IP packets. The reason for this is that the ALF approach 
appeared to reduce the number of routing and scheduling decisions needed to achieve the same data 
throughput rate. 


To better understand the real issues in this approach we reduced the conceptual framework of 
ALF to a specific technical proposal, which included details such as header formats and forwarding 
algorithms. 


To validate this specification, we coded the core of ALF, including the operating system support, 
in the Unix operating system. The first implementation was not targeted towards an aggressive 
performance target, but allowed us to understand the issues of state maintenance, the structure of 
demultiplexing and so on. 


Our experience with this implementation allowed us to construct a performance profile for the ALF 
forwarding process, which we then mapped on the hardware requirements for a high performance 
forwarder. 


2.5 Alternatives for Link Multiplexing 


A key issue for ALF is the problem of access latency at the switch. The transmission of a large 
application data unit over a slow link might delay the sending of a subsequent unit beyond acceptable 
limits, especially in the case of data flows requiring bounded real-time delivery QoS. It is thus 
necessary to break the ADU up into link-layer multiplexing elements, such as packets or cells. We 
explored an alternative, transmission of integral ADUs, but with pre-emption of the transmission if 
a higher priority data unit arrives. This approach permits us to build a system with the transmission 
latency of ATM cells, but which has none of the disadvantages of the small cells. We explored this 
approach as a way of better understanding the intrinsic issues in interleaving traffic with different 
QoS requirements and different data unit sizes. We proposed a specific switch design based on 
pre-emption, and performed detailed simulations on this design. By detailed simulation of links and 
switches, we uncovered a number of interesting issues, including a data clumping phenomenon that 
arises when small and large data units are mixed. 


2.6 Large Scale Distributed Systems 


While the major focus of the testbed was on speed, we explored issues of scale and decentralization. 
We discussed with Bellcore the design of the signaling system proposed for the Sunshine switches, 
called Expanse. We installed the package from Bellcore called Touring Machine, both to evaluate 
the possibility of using it as an application platform for Aurora, and to understand its basic control 
framework. 


2.7 Switch Design 


While our top-level interests are architectural, architecture must be validated by an understanding 
of the technology issues that arise in reducing architecture to practice. We participated in the design 
discussions for the Sunshine switch, both to assist in assuring that the switch would be suitable for 
the range of intended experiments and to learn about the interplay of design issues in developing 
the key network components. 


We also developed, as a part of the MIT network simulator, a component modeling the Sunshine 
switch, that could form the basis of detailed comparisons between the simulated and real environ- 
ment. 
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2.8 Prototyping Platforms 


A core component of our experimental program is the development of forwarding and end-node 
protocol software, both to understand the structural issues in reducing architectures to practice 
and to understand where the performance bottlenecks lie. This objective required the selection and 
development of a hardware and software environment for these experiments. 


The focus of the testbed generally was on workstations and high-end RdesktopsS rather than on 
supercomputers. While we understood that workstations might have intrinsic performance limits 
that would preclude operation at a gigabit rates, we felt that workstations were more representative 
of the eventual context for high performance networks, that workstations presented a more obvious 
and less convoluted memory and I/O architecture, and that workstations were likely to offer a more 
open and malleable programming environment than a supercomputer. 


For this reason, our focus in developing a software testbed environment focused on the issues of 
operating systems and network interfaces to machines of the workstation class. 


3. Experiments and Results 


3.1 Heterogeneous Network 


We implemented a fully functional IP packet forwarder with in the x-kernel protocol implementa- 
tion framework. We then analyzed the performance of this implementation and developed several 
modifications to the x-kernel framework which removed artificial performance limitations. Our goal 
in this experiment was to understand and measure the ratio of required functionality to overhead 
in this implementation, and by extension in other well-tuned protocol implementation frameworks. 
Understanding this ratio offers an approach to determining whether high-performance routing ele- 
ments can be implemented within a framework, or must be implemented completely Rfrom scratchS 
on raw hardware. 


The implementation of IP in the x-kernel was first tested in the context of a null network interface, 
an interface that required no instructions and thus imposed no overhead on the processing. Running 
on a Mips 3000 processor (a 33 MHz machine), the code was capable of forwarding approximately 
30,000 packets a second, and faster RISC processors continue to yield faster forwarding rates. The 
instructions executed were about equally divided between the IP forwarding code itself, and the 
x-kernel. This level of performance is consistent with what other careful implementations of IP 
have demonstrated on RISC processors. The major overhead of the x-kernel is the creation of a 
thread to handle the forwarding of each packet. However, the consequence of this approach to 
kernel organization, the factor of two in performance to run the x-kernel, is probably not acceptable 
for any context except an experimental platform. 


We next proposed and implemented some extensions to the x-kernel to reduce the overhead of thread 
and message buffer management. These modifications allowed us to increase packet forwarding rate 
of our experimental system to approximately 55,000 packets per second. After these modifications, 
approximately 60 percent of the CPU cycles were used to perform core forwarding requirements, 
while 40 percent reflected overhead. The additional increase in performance is due to improved 
instruction scheduling and memory cache utilization. 


These experiments gave an upper bound on the IP forwarding performance of our experimental 
hardware platform’s CPU and memory subsystems. Note that they intentionally elided the cost of 
managing hardware interfaces. 


We then extended our forwarding code to drive Ethernet (AMD RLanceS controller), ATM (Bellcore 
ROsirisS interface constructed for Aurora) and MIT’s FLORB interface. Detailed performance 
results for the experimental interfaces are given below. With our implementation of the ethernet 
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interface driver as the only hardware interface, the experimental forwarder operated at approximately 
3000 packets per second. This number is lower than the figure suggested by the raw capabilities of 
the hardware. We determined that the packet rate was limited by the workstations I/O subsystem 
and bus arbitration scheme. 


The core costs of processing a packet at the IP level are minimal. While it is difficult to extrapolate 
from specific implementations to a general conclusion, some rough estimates are helpful in getting a 
perspective on the requirements. Our code audits suggest that IP forwarding can be coded in from 
50 to 200 RISC instructions. The important processing steps, with costs in our context, are the 
following. 


e Checking header fields for correctness — 15 inst. 
e Computer header checksum — 15 inst. 

e Decrement TTL field — 5 inst. 

e Look up route in cache — 25 inst. 


e Add local network header to packet — 20 inst. 


One in three of these instructions touches memory, either the packet or the forwarding data struc- 
tures. If we assume one cycle per instruction plus one cycle extra for a memory reference, the total 
cycle count to process a packet is 107. A 100 mHz processor could thus process IP headers at a little 
less than 1,000,000 packets per second. At this rate, the minimum packet size that would allow a 622 
mb/s link to be fully utilized is 83 bytes. Based on this sort of analysis, it is reasonable to conclude 
that a practical gigabit IP forwarder can be constructed using a software forwarding engine based 
on a general-purpose RISC CPU. 


Our assessment of the processing overheads of the IP and ALF forwarder (see below) led to the 
conclusion that the burden of the IP processing was not the dominant cost of forwarding. In our 
initial experiments, described above, the cost of managing hardware was substantially greater than 
the processing cost at the IP level, as shown by the greatly reduced packet rate when the hardware 
interface was added to the system. Analysis of the code required to drive other common commercial 
network interfaces suggests that this is a generally accurate conclusion. 


Even with specialized operating systems and device drivers, we found that the cost of the device 
drivers for traditionally designed interfaces dominates the processing costs at the IP level. Adding 
to the cost of the device driver the cost of queue management for QoS scheduling, we conclude that 
the overhead of the device driver is the key to high-performance processing. One does more to speed 
up packet processing by changing the way one controls the I/O interface than by changing IP. 


3.2 High Performance Flow Forwarder 


On the basis of the above analysis, we proposed a design for a high performance packet or flow 
forwarder. While the time-frame of the project did not permit us to implement this device, we 
believe that the original design approach, first proposed in June of 1991, remains correct. This 
design, which is illustrated in Figure 1, involves a processor to manage each device, a processor to 
execute the forwarding code for each input path, and a processor to perform the background control 
tasks such as route computation and network management services. We estimated that given the 
processing rates of advanced RISC processors, this architecture could support reasonable packet 
processing rates for gigabit links. 


In addition to the packet processing limits, a forwarder must have sufficient memory and bus band- 
width to pass the data at gigabit rates, and sufficient decouplingJbetween the modules outlined 
above to allow the modules to function in parallel. To deal with the data rates implied by this 
architecture, we proposed two designs. In one, there is a buffer at the middle of the data path, with 
a data moving element attaching each network device. In the other design, especially suited to a 
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Figure 1: Block Diagram of High-Performance Flow Forwarder 


two-port device, there is a efficient data mover at the center of the data path, which cross-connects 
the buffers on each side. This latter design seems to offer some advantages, and was used in the 
development of the FLORB interface, discussed below. 


3.3. Memory Architecture of Workstations 


We performed a number of tests on a (then) state of the art workstation (Digital Equipment Corp. 
Model 3000/400) to characterize latency and throughput across the I/O interface and the memory 
bus, and CPU caches. Our intention was to understand the workstation’s performance adequately 
enough that we could later relate network performance to these fundamental properties of the 
hardware implementation. Our goal was to differentiate between limits imposed by the hardware 
and limits imposed by poor structuring of software, and to be able to determine how fast a particular 
network-intensive application, protocol or driver should run. 


The rather obvious model holds. When data arrives from the network, it is placed in memory. 
Then the application picks it up. There will thus be a minimum of two memory cycles as a part of 
receiving data. (Sending is the same). Protocol overhead will add to this count of memory cycles. 
So to attach (usefully) a gigabit network to a host, we need a host with a memory bandwidth 
of 2 gigabits/second. Very few workstation-class computers have that memory bandwidth even 
today, and thus end-to-end gigabit applications were not literally achievable in the time frame of 
this research. In time, these speeds will naturally come. What we showed is that one can push 
up against this limit in practice. One actually can build implementations that run at the speed of 
memory. For traditional workstations (some parallel and supercomputers are special cases) since 
memory is the expensive component, and processors seem to get faster than memory does, we do 
not anticipate a sudden protocol processing bottleneck. Memory bandwidth to move the data will 
continue to be the fundamental throughput limit. 


3.4. QoS 


A major thrust of our research into QoS management was to investigate theoretical and imple- 
mentation issues related to resource management and control algorithms for an Integrated Services 
Packet Switch (ISPS). Our Aurora objective in this area was to show that the processing necessary 
to enforce QoS scheduling in switches could be done at high speed. To accomplish this, we developed 
algorithms that completely realized our proposed model, characterized their performance, and then 
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developed approximations that were more efficient to execute but still close in functionality. We 
tested these in simulation and in the Sunshine switch to verify their performance and correctness in 
various situations. 


The ATM framework, which includes explicit circuit setup, can in principle support a very sophis- 
ticated range of QoS management capabilities. However, in practice the first products and the first 
standards depended only on admission controls at connection setup time, together with a very simple 
queue discipline such as first come, first served. Early simulations performed at Bellcore suggested 
that in real-world conditions (traffic with realistic levels of burstyness), this simple approach would 
support only very low levels of link utilization before failing. We thus concluded that we should 
develop and propose a cell queuing scheme to improve link loading in ATM networks. 


Working with Bellcore, we determined that we would propose to the ATM Forum as a first scheme 
a very simple weighted fair queuing (WFQ) mechanism. After some discussion, we found that the 
overhead and complexity of even this simple scheme was a major source of concern to implementors. 
With our collaborators at Bellcore we responded to this concern in a number of ways, including the 
development of the high-efficiency approximate HWFQ algorithm described below, and development 
of the second generation Sunshine Output Controller Card. Given the concern with the performance 
of a simple WFQ scheme, we recognized that the complexity of our full scheme, which required a 
number of queues, would be even more problematic. Oumar Ndiaye and Christopher Lefelhocz 
investigated one solution to remove the complexity. 


As part of his master’s work [17], Ndiaye laid out the groundwork of the link sharing algorithm, 
developed a more efficient algorithm, and simulated the new algorithm. We briefly mention each 
of these areas below. In our full model for link bandwidth allocation, the bandwidth a customer 
requires can be thought of as a service. Since a customer may also be a provider of service, the 
service can be broken down into several sub-services. A service that is broken into sub-services is 
called a service class. A service that is not broken into sub-service is called a real service. A tree 
structure that shows the sharing of a link’s bandwidth among customers can be generated. This 
tree structure has service classes as internal nodes in the tree and real services as leaf nodes in the 
tree. 


Ndiaye identified two goals to the sharing of a link’s bandwidth : 


e At any active period of the link (when there are cells to transmit), each active customer is 
guaranteed a share of the link no less than the portion of the link bandwidth it owns. 


e For any sub-service, the amount of link resource at its disposal that it is not using is distributed 
to all its sibling sub-services that need more link usage than the amount they are entitled to. 
This distribution is done proportionally to the rates of the sub-services. 


To support these properties, our original link sharing algorithm used a Hierarchical Weighted Fair 
Queuing(HWFQ) mechanism. Weighted Fair Queuing is a scheduling discipline that classifies each 
incoming packet into a service class, and then maintains a sorted list of classes, based on a computed 
departure time for that class given its service allocation. Hierarchical WFQ consists of a WFQ 
mechanism at each non-leaf node of the tree. Each leaf node contains a queue of cells to be sent. 
For each non-leaf node, weights are assigned in proportion to the amount of service requested by 
each child. The entire mechanism can be thought of as a WFQ of WFQ’s. This mechanism was 
found to be accurate in preserving the properties stated above. However, the complexity was large 
since the number of different WFQ’s involved in the sending /receiving of a cell was proportional to 
the depth of the tree. We performed a study to determine where the costs were in our algorithm, 
by actual audit of path lengths, and determined that the major cost was entering the data structure 
representing its service class into the ordered list representing the WFQ at each level during the 
queuing stage. This complexity is particularly troublesome in an ATM network where the number 
of cells to send/receive is large. 


Bellcore had developed a sequencer chip [3] which performed WFQ sequencing in hardware, but 
that chip only implemented one ordered queue, while our full algorithm depended on a hierarchy of 
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queues. This complexity made it impossible to implement our scheme directly using that chip. 


We concluded that a flat, single-level queue structure, in addition to matching the particulars of 
the Bellcore sequencer chip, was likely to lead to efficient implementation in a number of cases For 
this reason, we sought an algorithmic approximation to our hierarchy of queues that depended on a 
single queue. 


Ndiaye proposed a way to approximate the behavior of the HWFQ by mapping the weights of the 
real services into a single flat WFQ. To preserve property 2, periodically a separate algorithm was 
run which determined which classes in the HWFQ were active and recalculated the mapping from 
service classes to real services in the single WFQ. Whether a real service was active was based on 
whether it had cells waiting to be sent. 


This simplified the complexity of the send/receive algorithm. Given that the sorting portion of 
the WFQ mechanism was implemented in hardware, the receiving of a cell was estimated to take 
55 instructions and the sending of a cell to take 70 instructions. These were estimates since the 
algorithm was tested through simulation rather than on a physical output port of a switch. 


In simulating the new algorithm, Ndiaye varied the period of update of the mapped weights. It was 
found that for smaller periods, the results were almost the same as the original algorithm. For larger 
periods, the results were a good approximations to the original algorithm. Thus it is believed that 
for larger sharing trees where the updating algorithm takes longer, the performance of the simpler 
algorithm will be a good approximation to the performance of the complex algorithm. 


While Ndiaye’s work was being completed, an ATM Output Port Controller (OPC) was being 
developed for the Bellcore Sunshine switch, which contained the features necessary to test out the 
complexity of the algorithm. Lefelhocz took the algorithm developed by Ndiaye and implemented it 
on the new output port controller. The intent was to learn more about the coding of the algorithm 
while showing that such an algorithm was fit for use in an ATM environment. Also it was discovered 
that the updating algorithm could be done in a simple pipelined manner. These results are described 
below. 


The processor used on the OPC for implementation of our proposed algorithm was an Intel 960CA 
33 MHz microprocessor. Given the bandwidth of the output port, the number of instructions which 
can be executed in one cell time is approximately 80. Simplifying Ndiaye’s algorithm to the extreme 
it was found that C code compiled into assembly took 32 instructions to receive a cell and 46 
instructions to send a cell. Since during periods of congestion, two cells can arrive at the OPC 
during the time one cell leaves, the receive algorithm is run twice in the worst case. Thus, the 
processing cycle would take a total of 110 instructions to complete. Simple analysis of the compiled 
assembly revealed 7 instructions could be removed from the receive algorithm and 6 instructions 
from the send algorithm. This resulted in a instruction count of 88 instructions only slightly above 
the ceiling of 80. It is believed with only slightly faster processors, this algorithm could run at full 
rate. 


It was hoped that the same microprocessor could be used for both the send/receive algorithm and 
for the background task to recompute the weights based on the current activity of the classes. The 
speed of the processor used did not permit this. However, two options exist for making the weight 
recomputation algorithm run as a “background process”. The first is to add another microprocessor. 
The mapping algorithm can run independently of the send/receive algorithm. The second is to use 
a faster microprocessor. Since the number of instructions is only slightly larger than the current 
speed, a processor which is twice as fast would be capable of running the mapping and send/receive 
algorithm and meet the cell time performance. 


Finally, in the process of developing the mapping algorithm for the 1960, Lefelhocz found that the 
weight recomputation algorithm proposed by Ndiaye could be improved so that it computes a new 
set of weights for each tree traversal. The original proposal required two distinct phases to check 
for activity and to recalculate the rates. The former phase is a post-order tree traversal. The later 
phase is a pre-order tree traversal. The two phases cannot be combined. However, the phases can 
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be pipelined such that the first rate mapping occurs after the second tree traversal, the second 
rate mapping occurs after the third tree traversal, etc. In this way, the algorithm is pipelined to 
constantly update the mapped weights of real services. 


3.5 Congestion Control 


The current algorithms for TCP congestion control are rather complex, and depend on specific 
details of the design. Prior experiments in our simulator have been limited by the ability of the 
simulator’s TCP component to represent all these details. To remedy this limitation, we began our 
work on TCP congestion control by adding to our simulator a very detailed representation of the 
TCP congestion control mechanisms. Our approach was to modify our simulator so that we could 
incorporate into it code directly derived from a real TCP implementation, so that the behavior 
of the simulator and the TCP in the real network were based on the same implementation. This 
eliminated any concerns as to whether the simulation component was a proper representation of the 
real world. 


Based on this code, we explored a number issues in the operation of TCP, to understand better how 
it might be tuned to perform better over long delay high bandwidth circuits. 


We identified two major effects that limit performance. First, the so-called long timeout event, in 
which a lost packet is detected only when a retransmission timer at the source expires, represents a 
degradation of performance whose impact becomes much more important as the speed of the link 
increases. Thus, it becomes increasingly important to avoid this class of event by finding alternative 
techniques for detecting packet losses. Second, reducing the window size to one after a loss event 
slows the throughput to such a degree that reasonable throughputs are not regained for much too 
long a period of time. Thus, a less conservative backoff algorithm should be employed when it can 
be justified. 


A MS thesis by Janey Hoe [9] proposed algorithms to achieve these goals. To respond to a higher 
percentage of lost packets without invoking retransmission timeouts, she proposed an extension to 
the currently used “fast retransmit” scheme. In the current scheme, the sender concludes that there 
has been a lost packet when three duplicate acknowledgments have been received, since duplicate 
ACKs are sent by the receiver when arriving packets do not come in a continuous sequence. Her 
modification was to extend this idea to incorporate a fast recovery mode. Fast recovery mode is 
initiated whenever a fast retransmit trigger event (three duplicate acks) occurs. Once fast recovery 
mode is initiated, additional packets are retransmitted whenever an acknowledgment is received 
that acknowledges fewer than the known number of outstanding packets. This modification is very 
effective in distinguishing between lost packets and other sequences of ACKs that reflect normal 
operation, and allows rapid recovery from an instance of multiple packets being lost within a single 
round-trip time; a circumstance significantly more common in networks with a high bandwidth-delay 
product. 


Hoe’s second proposal involves allowing a larger window size after a loss event, but explicitly limiting 
the size of a burst of packets that a sender can emit at any one time. By limiting the burst size, 
it seems possible to be more liberal in setting the window size after a congestion slowdown. This 
allows the TCP to keep the pipelined flow of data active, while still providing protection against 
repeatedly triggering episodes of congestion. 


3.6 ALF 


Our assessment of the code implementing the ALF forwarder and end-node is that there is a slight 
increase in overhead in processing ALF in a forwarder. This increase is offset by a slight simplification 
of the processing in an end-node. The increase in the forwarder is caused by the fact that while the 
forwarder must now deal with forwarding at the ADU level rather than the cell or packet level, it 
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is still necessary to recognize and deal with the transmission multiplexing unit to some extent. So 
there is some aspect of packet processing together with the higher level ADU processing. 


In exchange for this increase in cost, there are substantial simplifications in the processing at the end- 
node. The benefit is not a reduction in code path lengths, but in scheduling and system overhead. 
In the case of ATM networks, for example, there is no intermediate processing of packets, which will 
reduce the interrupt rate and process scheduling rate of the system. More importantly, the ALF 
framework defers virtually all processing of messages until the application can actually act on the 
data present. This allows the operating system scheduler to efficiently manage processes which are 
using the network, and eliminates the unfortunate effects of “layered multiplexing.” 


ALF and ILP were first proposed in the context of a specific alternative architecture to the IP and 
TCP architecture. As the ideas became more widely understood, they instead served as a model 
to reason about designing and coding protocols and applications in the context of IP. An example 
is the work on Light Weight Sessions [8]. At this time, ALF is best thought of as a new model for 
thinking about how to use existing protocols and mechanisms. 


3.7 Alternatives for Link Multiplexing 


One of the system level objectives for ALF was to use the native multiplexing mode of the network, 
for example packet or cell switching, in an efficient manner. However, one could ask the question 
in another way and inquire, once the application has broken the data into ADUs, what the most 
efficient network multiplexing mode would be. Cell switching offers the advantage of very low latency 
in admitting high priority data into the network, since the network can be reallocated at the end 
of every cell. The penalty for this is the cost and complexity in the switch controller of making a 
scheduling decision for every cell. Since cells are small, the processing rate is high. One could make 
the data units larger, as in a classical packet switch, and thus reduce the header processing rate, 
but the longer latencies waiting for a packet to complete transmission may cause undesirable delays 
for high priority traffic. 


Chris Lefelhocz, in a MS thesis, explored a new scheme which avoids this tradeoff by constructing 
a network based on pre-emption. The basic mode of operation is that an entire ADU is scheduled 
for transmission over a network link, but the ADU can be pre-empted as needed if a higher priority 
ADU arrives. The required processing rate in the switch controller is thus reduced; it corresponds 
to ADU arrivals rather then cell or packet arrivals. At the same time, the latency to forward a 
high-priority ADU is minimized. We hypothesized that this architecture might provide the best of 
both previous approaches: better than the low latency of the cell architecture but lower processing 
demands than the packet architecture. 


We developed a detailed simulation of such a switch, and explored its operation with a variety of 
traffic loads and switch configurations. In the abstract, the basic scheme seems to offer the expected 
benefits. With a realistic set of estimates for ADU sizes, a system with less than one tenth the 
processing power of a cell switch seems to offer effective operation with the same latencies. The 
detailed results are presented in LCS-TR-621 (in publication). 


In practice a new scheme such as this is not likely to completely displace either the classical packet 
switching schemes or the rapidly advancing ATM schemes. The results of this study, in addition to 
suggesting a new type of link-level technology, can be extrapolated to show how an ATM switch, 
for example, might be implemented more efficiently. If the default behavior of an ATM switch was 
to send a continuous flow of cells from one VC, unless a cell from a flow of higher priority were to 
arrive, this algorithm might reduce the expected processing cost for each cell. 


The simulations described above identified one very important issue, which applies generally to 
packet, cell and pre-emption systems. The issue is the phenomenon which we have seen and de- 
scribed as “ack compression” in packet systems [20]. In a pre-emption system, if data packets and 
acknowledgments are at the same priority level, then a number of acks can accumulate behind a large 
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ADU, leading to a burst of acks being later delivered to a end-node, which in turn results in a burst 
of subsequent data packets. One solution to this is to assign acks a higher priority, which solves the 
problem, both in the packet and the pre-emption case. However, this solution requires the switch 
to distinguish acks from data packets, which is not consistent with most layering architectures. The 
alternative of giving all small packets a higher priority leads to other anomalies, in which users 
can obtain preferential treatment by using the network in very inappropriate ways. (Small packets 
represent a very inefficient use of the network.) 


3.8 Development of Workstations Environment 


As part of our development of a software environments suitable for forwarder and end-node network- 
ing research, we undertook a number of development projects to obtain a useful operating system 
for a current generation workstation. We imported an implementation of Mach, and explored its use 
in this context, but concluded that its internal structure did not naturally lead to the most efficient 
and straightforward implementation of network code. Our current target for an open system (free of 
license restrictions) is one of the variants of the openly available Unix BSD 4.4 derivatives. We have 
worked with the community to support these packages, and have imported and incorporated into 
our computing environment such open systems on Digital’s Alpha-based workstations, Intel-based 
PC’s and Sun workstations. This hardware independence allows us both to adopt new technology 
as it appears and to compare the effects of different hardware designs on the performance of highly 
tuned scheduling and forwarding algorithms. 


Another of our projects involved the importation and evaluation of the x-kernel, an framework 
for the implementation of network protocols done at the University of Arizona. We performed a 
number of projects with the x-kernel. We implemented an IP forwarder and a version of our queue 
management for QoS to evaluate the architectural features and the performance. We identified no 
architectural problems inserting these features into the system, but, as noted above, the cost of the 
x-kernel task abstraction may be somewhat too high to permit the system to be used as a basis for 
a specialized situation such as a packet forwarder. 


3.9 FLORB 


As part of developing our prototype platform for implementing forwarders and related software, we 
developed a network interface called a Florb (a contraction of Flow to Orbit interface). The interface 
was first designed to interface to the IBM Planet/Orbit technology, but in fact is a very general 32 
bit high speed interface suited to a wide variety of network devices. Our prototyping platform at 
the time was the DEC workstation, and the Florb interfaces to the Turbochannel interface for that 
family of machines. 


A block diagram of the Florb is shown in Figure 2. The novel aspect of the Florb is its architecture 
for moving data, which we believe is suited for a wide variety of high speed applications. At the 
center of the Florb is a CPU, fast SRAM, and high speed data bus, which connects all the relevant 
inputs and outputs of the interface card. At the periphery of the card are a number of retiming 
FIFO’s which allow the various interfaces to operate with some degree of asynchronicity. Data 
movement between the FIFOs and connected devices (the host I/O bus, the network or any other 
device) may utilize a DMA controller or an external FIFO clock, as required. The data is moved 
from external devices into FIFOs, and the FIFOs are in turn connected to this central bus. 


Transfers of data across the bus are controlled by a RISC processor (in our implementation, an Intel 
80960). However, the way the processor controls the bus is somewhat unusual. Rather than issuing 
read and write instructions to move data to and from the various FIFOs, individual address lines 
of the processor are wired directly to the read and write ports of the FIFOs. Thus, to read from a 
particular FIFO, the CPU simply issues any instruction which loads the external data bus onto its 
onboard registers, with the particular memory bit set on. 


43 


VP 


posed [eutoy-x Ino WOT Ay} eI1p S9oTAIP {YO TY 24} vAorp pure 1/aSO fo 98 9} Pe}EUTUIT[S PXOT OAA 


‘poqoodxe ueyy JoYyS1y ATTeTyURYsqns sem UoryeyUstetduMt foooyoid gq] T/ASO ey 
JO peostAeAo 9Y} YY} PoIsaoosip aM ‘poyJoUr SI} SUIS.) ‘oIeMpIeY Buryonoy ATTeNJor ynoyyiM syoyoed 
peydsooe pue pajerioued YOM soejJIOqUL YIomyou GTTNUY e Buryuseweydut Aq Jjesy Uotyeyueuteyduit 
Joooyoid ylomyou T/ ASO ey} Jo soueUlIojJed oy pozioqoereyo om ‘JeyyIny Wotqoid sty} Apnys oy, 


‘MVMpPIEY UOTVLISYIOM ay} JO SUOCTILYIWUT] WMOTY Aq pouteyTdxe oq you 
p[noo soueutIojiad Jood ayy ‘ieyqyingq ‘Ayloedes [ny Jo yusored QE ATeyetutxordde ye Suryesado sem 
lossao01d 8. gUO TA ey) ‘poeds siyy yy ‘s/qui ogz Ajoyeumtxoidde Jo yndysnoryy v ye poyesodo yury oy 
yey} Puno} aA ‘pa}yoodxe UeY] JOMOT SVM JUSUTOSULITe SIy} YIM poas#§tpoe ATTeIWUT sue UTIOJIed stLJ, 


“‘SOoVJIOJUL YIOMJOU PIePUeIS SV SOAP 9} payed) YOM 
woysks Suyeriodo [/7SO 8,[eHBIq IO} SI9alIp sd.1Aep pedojaasp osTe oa ‘sootaep ayy} Jo Uolyededo 
qoodIoo AJWI9A OT, “UIT YIomyou yutod-oj-jutod e WIIOJ 07 YoRq 07 YPoeq Ss GYOTA OM) Buryoouuos 
&q peaystpdwosoe sem styy, ‘sotytyiqedes souetuiojied syt ayeajsuowlap pue ydaou0. ayy Jo AqIprpea 
aq} aUTUIajap OF sem [ROB TRIPUT NO ‘GYOTA 24) YIM syuewtiedxe Jo Jequinu ev peyonpuUod aA 


‘IOALIP JoXvy] TeotsAyd PALLY ue se yons ‘oovpsoqur 
YJomyou oytoods e Joy paimnbar syuetoduroo ayy ploy oO} UoTYsey yoeq-ASSId e Ul payunou oq AeUI 9zZIs 
ses oY} JO pled [RUOTIIPpe UY ‘9ZIs UL soypUT G Aq sayout FP AjTayeutxordde preog yMod1d apsuts e Jo 
S}sIsuod sIVMpILY GUO TA 0d at, ‘AV Aurts yeoI3 YIM Poadkllpr st UBISap sty Jo soueUIOJIed at, T, 


‘spoods JaySty Usae ye snq & 9ATIP 
P[MoOd soyeI Yop [VUII}Xo IOYSIY YIM slossododd salssolBSe slop “S/qUI YOR Jo ojed doysuery eyep 
yeod ve satjduit yor ‘yOo] ZAI GZ JSepoul & sosn qJO[Yy oY} Jo UoTyeyUoWaTduII ING ‘9[DAD Iossa.01d 
suo Ul payst[duios.e aq Ued JejsuRTy VYep Yee Wey} SI (UOT 24) Jo soueulsojsied ayy of Ady at, 


“pestadejol oe aoeds ssolppe of} fo SUOTSaI YoryM Uo posed ‘sIseq 
UOTPONAYSUT &q UOTPONAISUT Uk UO SO9POUT OM} VSoT} Us9MJoq YOIIMS UPD YT ‘laduonbas-O1 DTU eB Jay yo 
oy} pue ‘rosso 01d OSTY [eulrou e 910 ‘SOpoul OM} UT posh Suleq jo atqedeo snq} st Josso.01d mL 


“qsoY] aT} 07 Yorq 
yt deim pue ‘yIOMYOU sy} OF YI oYLIM “Soy oY} WOU oJAq BV pRdt P[NOD 9UO ‘UOTJINAJSUL 9UO UL ‘sntLT, 
‘syutod Jeyjo Jo Jaquinu & 4e Jo 4 Suryey pue snq sy} Uo anyfea e& Suryynd Jo yooys oy} sey soqlim 
a{dyjnw pue peed suo ynq ‘IodJa ue yUaseIded p[nom [edeues UL spear os[dypny_ ‘ATSnosuey[Nuts In990 
0} SoylIM pue speed Jo JaquinU eB asned Wed ‘SoUI]T SSaIppe st] JO [eIoAss Sulyyos Aq ‘MOTJONISUT 9UC 


soRpI9JUT YIOMJPN-ISOH CHOTA Jo weiseiq Yor, :z% eans1yq 


Asowow 
e007 


JSOH 0} eoeL9}u| 


x 
— 
° 
s 
=] 
® 
2 
ie} 
2 
® 
a) 
© 
£ 
— 
© 
2 
= 


° 


IP router. In this configuration we achieved an overall effective throughput of approximately 400 
mb/s for a mix of IP traffic created by tcplib [6]; a library which artificially synthesizes network 
traffic based on statistics from real networks. At this processing rate, the FLORB’s cpu’s were 
operating at approximately 65 percent of capacity. 


To determine the bottleneck limiting performance to this figure, we profiled our forwarder’s operation 
using the Alpha’s hardware cycle counters. We learned from this experiment that our performance 
was limited by I/O bus arbitration and memory contention effects within the workstation, and could 
not be further improved. 


We conducted one further experiment with the FLORB in this configuration, varying the traffic mix 
to cover a range of packet sizes. We determined that as we moved to smaller packets the per-packet 
handling overhead never became high enough to exceed the capacity of the FLORB’s cpu, although 
the overall throughput of the system fell significantly with very small packets. 


The development of the FLORB demonstrated two points. One was that the control interface 
to an I/O device need not be awkward and slow. As noted above, processing overheads in packet 
forwarders are not typically dominated by IP level processing steps, but by the overhead of controlling 
the associated I/O devices. The FLORB was designed to demonstrate that device control can be 
efficient. In our tests, the efficiency of the overall device handler was limited by the workstation’s 
bus arbitration, showing that the FLORB design itself was not a limiting factor. 


The second point it proved is that a general purpose RISC chip can be the basis of a very high 
performance network interface. The FLORB incorporated a Intel 960, but used this processor in 
a novel way. There has been a continuing debate as to whether it is worth designing and using 
special chips for I/O control. Such a chip can have special functions on it, which could conceivably 
lead to higher performance. On the other it is a more specialized product with less anticipated 
total demand, a chip vendor cannot afford to put as much effort into such a chip as into the next 
generation RISC processor. So by the time the special I/O control chip comes out, the RISC chips 
then appearing can go just as fast. Our design, by showing a creative way to exploit a RISC chip 
for micro-control, lends support to the thesis that special control chips are not worth the effort to 


build. 


4 Lessons Learned 


4.1 Performance and Architecture of Packet Forwarding 


We observed that in a packet forwarder (a router or flow forwarder) the device driver and queue 
management algorithms constitute the major source of overhead. This is a critical observation, 
because it suggests that there are limited opportunities for speeding up a packet forwarder by 
parallel processing. The device driver seems to be intrinsically serial. It involves device registers 
and packet queues that must be updated automatically, and per-packet operations that must be 
done at each arrival and departure time. Since the network hardware sends packets in a serial 
manner, the resulting timing of the device code is serial. One can use a separate processor for each 
device, and one or more for the IP processing, but one cannot speed up the processing by massive 
parallel execution of the code. Given that the cost of IP is already not the dominating cost, making 
it parallel will not be strongly effective. 


This line of reasoning suggests the following approach to forwarder design. Instead of a focus on 
silicon for the forwarding code (e.g. IP in hardware) there should be a focus on very efficient 
hardware for network interfaces, and for key processing steps in the queue management, for example 
the cell sequencer chip from Bellcore described elsewhere in this report. Even hardware support for 
the allocation and freeing of buffers can have a significant impact on the processing overhead. Our 
overall conclusion is thus that while a massive parallel forwarder is not an effective approach, that 
selective hardware support combined with a state of the art RISC processor is an effective way to 
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build a high performance router. This is a very desirable conclusion, since we argue very strongly 
that to support flexible evolution of the network services, it is critical that the parts of the algorithm 
that reflect the details of the offered service be coded in a flexible manner, for example in software, 
so that the packet forwarder can evolve as the service definition does. 


There is an evolutionary path which forwarders and switches will follow, both in the current product 
offerings and in the higher speed context. The first generation packet forwarders — gateways or 
routers — combined in one element the format conversion implied in moving from one network type 
to another and the cross-connect capability of switching packets between a multiplicity of interfaces. 
This was reasonable, as a very cost effective way to build a router at current mainstream product 
speeds is to put the packets into a central shared memory buffer. However, at higher speeds, memory 
is not as effective a cross-connect architecture as parallel switch fabric designs, because the single 
shared communication point (the shared memory) must operate at very high speeds. However, with 
a switch fabric at the core of the switch, it is difficult to utilize a single processor to perform the 
forwarding decisions necessary at all ports of the switch. The architecture thus more naturally 
becomes a separate forwarding processor for each switch port. 


In this more evolved architecture, the format conversion is performed in each of the switch port 
controllers, and the traditional router becomes a two-port device connecting the line to the switch 
port. This simplification of the router architecture makes it much easier to build a cost-effective high 
performance router element. Unfortunately it also introduces some limitations, in that it becomes 
much harder to implement global resource management algorithms. In the shared-memory switch 
these algorithms can be implemented on a single processor, while in processor-per-port design they 
must be implemented in a distributed manner. Managing this and other tradeoffs implicit in the 
processor-per-port design is a topic requiring further exploration. 


4.2 The Fundamental Limit of Memory Bandwidth 


One fact that came up repeatedly in this project is that there was widely varying expectations among 
members of the project as to how fast a computer of a given design would be able to deal with data 
coming across a network. A number of possible performance limits were postulated as the real 
bottleneck, including the design or implementation of the network protocols. What we concluded, 
based on a number of experiments and calibrations, is that in most cases, the real determinant of 
achieved throughput is the bandwidth of the computer memory, and the effectiveness of the I/O 
connection into that memory. 


In general, the bandwidth of the memory must be at least twice, and more practically, four times the 
speed of the desired overall throughput. Looking at arriving data (departing data is similar) there 
are at least two memory cycles required to move the data from the network: the network controller 
must write the data to memory, and the application code must pick it up. Protocol processing may 
add perhaps another two cycles, to move the data between system and user buffers and to compute 
checksums. Creative design can eliminate one or more of these cycles, but the memory bandwidth 
still remains the basic limit to achieved throughput. 


On the other hand, with careful design of system software, these limits can actually be reached. 
While overhead in the system and the protocols can be a practical cause of lost performance, these 
issues can be sidestepped if the effort is justified. 


This balance of limitations will continue to hold into the future. We do not anticipate some reverse in 
the current situation, in which processing overheads become more of a bottleneck, and memory less. 
Current trends in silicon design are making processors continuously faster, while memory chips are 
getting bigger, rather than faster. The speed of a memory architecture can, of course, be increased 
by use of wider memory busses, cache architectures, and so on. But these approaches add more cost 
than a faster processor, and thus we can expect, for future machines, that as memory speeds are 
increased, the processor speeds will increase to match. 
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This analysis does not address one key issue, which is the potential for a mis-design in the way the 
network is connected into the memory. Most memory designs today focus on providing a high speed 
transfer path between the memory and the processor. This objective can, in simple cases, lead to 
designs in which network interfaces cannot interact with the memory at highest speeds. Both we 
and other members of the project spent a substantial amount of time dealing with this issue as a 
practical impediment to progress. However, we conclude that this issue is not fundamental, and 
could be resolved as an engineering issue in the design of future generations of workstations if the 
objective is warranted. 


4.3. The Flexibility of Software 


One means to increase speed is to cast the critical processes in hardware, in the hope of making 
them go faster, or of going faster at a lower cost. While this approach is technically feasible, it can 
essentially lead to a disaster if the desire for speed is not balanced by the need for flexibility. 


ATM is a good example of this tradeoff. ATM cells are small, and must thus be processed at 
high rates in order to service a high speed link. At 622 mB/s, ATM cells arrive at over a million 
per second. The concept, when ATM was first proposed, is that the actual processing of the cells 
would be a very simple act that could thus be performed in hardware. Only the higher level control 
functions, such as opening connections, would need to be performed in software. 


This assumption has not been totally justified so far, and has led to some of the growing pains of 
ATM. For example, some of the QoS work done here cannot be implemented without changing the 
actual cell processing in the switch. If that process is cast in hardware, there is no way to upgrade 
an existing ATM switch to support explicit QoS. In contrast, in packet switches that perform these 
steps in software, this migration is Just a new software load. 


Data networks have evolved rapidly over the past several years, in part because the field is young 
and new approaches emerge, and in part because customer expectations for service evolve, as for 
example the desire for multi-media networking. This evolution calls for products with flexibility. 
Part of our research was to show that high speed packet processing could be done in software. Our 
conclusion is that indeed, packet switches that employ general purpose processors for forwarding 
are reasonable. Our research suggested two factors that can contribute to a high performance, cost 
effective network device, whether packet switch or host interface. The first is to engineer the actual 
network interface hardware so that it does not represent a high overhead to control. Some current 
controller chips require as many instructions to control as the packet processing computation. 


The second point is less obvious. There are indeed simple, basic building block operations in the 
forwarding of a packet that could be cast in silicon, and would result in increased performance at 
reduced cost. However, there may not yet be enough understanding to let us say with confidence 
what these modules are, and getting the design wrong can have a disastrous effect on the final 
flexibility and performance of the resulting device. 


Perhaps the most simple building block that could be cast in hardware is allocation and freeing 
of buffers. Hardware support for a free list of buffers would represent a useful component. The 
Bellcore sequencer chip represents a more ambitious step in this direction, in that it implements 
a list of buffers sorted by hardware according to departure time. This chip allowed us to realize 
an ATM output port controller that could implement a WFQ cell scheduler at OC-3 rates, and 
indeed which could approximate a hierarchical WFQ scheme at essentially this speed. However, the 
interface to the chip, in retrospect, could itself have been lower overhead, due to issues of dealing in 
software with the possibility of field overflow. Simple issues such as this must be resolved as ideas 
for hardware building blocks mature. 
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4.4 Speed May Not Be The Hardest Problem 


We concluded that the issue of speed alone represented only the obvious problem of efficient imple- 
mentation and adequate bandwidth. What we showed in our research is that it is possible to build 
devices that perform well enough to hit the fundamental limits of memory and link bandwidth. We 
did not see any issues that suggested that this would change in the short term future. The more 
complex issue of high delay-bandwidth product paths may represent more significant problems, as 
issues of flow and congestion control need to be rethought. This matter is not a simple one of speed, 
however, but an issue of speed coupled with scale. It seems quite practical to implement a transport 
protocol such as TCP on a suitable hardware platform and expect it to go at a gigabit. The interest- 
ing question is what will happen when many copies of this implementation contend for bandwidth 
over a shared link, and deal with issues of congestion. These testbeds were not large enough to 
reveal any of these issues, which for the moment must be dealt with through the somewhat artificial 
means of simulation. 


4.5 The Importance of a Suitable Experimental Platform 


Especially at high speeds, attention to detail and careful design is required for success. It only 
takes one bottleneck to throughput to limit the overall experiment. This situation meant that it 
was necessary, in almost all of the experimental situations, to tune all parts of the system for good 
performance, and to understand what the root causes were of apparent performance problems. 


This is work that, while highly necessary, is not directly related to the research question at hand. 
It is thus important to structure the project such that the difficulty of this work does not come to 
dominate the whole project. We identified two success factors in this desire. First, the experimental 
platforms should be shared among as many as possible of the participants, so that experience could 
be shared. Second, proprietary aspects of the experimental apparatus should be minimized, so that 
time is not wasted trying to deduce how a hidden part of the system is influencing the overall 
performance. 


5 Conclusions 


The gigabit objective, in retrospect, should not be thought of as an actual numerical goal to which 
performance should be pushed. While it may not have been obvious at the beginning of the project, 
the gigabit objective will be achieved in due course as processors and related devices become faster. 
Rather, the objective of the project was properly to push performance forward to the extent possible, 
both to understand what the fundamental limits were and to excite people to the prospects of 
achieving higher performance in practice. Other parts of the testbeds looked at different issues, 
such as speculation on how applications might change if these bandwidths were available. What 
we concluded that we can indeed build systems with high performance, and that we understand 
what the fundamental limits to performance are, and how to reach those limits. Those limits are 
not absolute numbers, but derive from the capabilities of key system components, most particularly 
memory bandwidth. When we have a workstation with a memory bandwidth somewhere between 
two and four gigabits per second, we will have a workstation that can be networked at a gigabit. 
Depending on how the I/O architecture is constructed, we are not far away from that objective 
today. 
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